I am not sure how to test for that. If you plotted the residuals into a QQ plot and it suggested normality, would that be a valid way to be sure the regression model had multivariate normality?

- Thread starter noetsi
- Start date
- Tags multivariate normality outliers

I am not sure how to test for that. If you plotted the residuals into a QQ plot and it suggested normality, would that be a valid way to be sure the regression model had multivariate normality?

A quick google search also shows there are R packages (PDF alert).

Maybe you could look into the theory behind these tests and comeup with something satisfactory?

i know noetsi uses Mplus, and Mplus also gives you mardia's test.

now, for what reason in particular do you need to test for multivariate normality?

I can not use Mplus at work (the state will not allow it's purchase nor let me purchase it personally and place it on the computer - don't ask why) and it will be a while before I learn R.

Is it legitimate to use the residuals of a regression in a QQ plot to test for normality?

once you get a large enough sample size to detect departure from normality then you have a large enough sample to not care about normality...

I can't answer for him but I feel similarly. Typically they aren't that great with small sample sizes and once you get a large enough sample size to detect departure from normality then you have a large enough sample to not care about normality...

I mean look what one outlier in 5000 does to a shapiro.test

Code:

```
# test once
shapiro.test(c(rnorm(4999),-5.5))
# test 100 times
pvals<-replicate(100,shapiro.test(c(rnorm(4999),-5.5))$p.value)
plot(density(pvals))
abline(v=0.05,col='red')
```

large enough sample size to detect departure from normality then you have a large enough sample to not care about normality...

oh pff... that's just bad data practice not to check (and dump) outliers before the analysis

You should only ever "dump" outliers, kicking and screaming, being very certain they are errors.

You should certainly not dump them on a reflex!

Best thing I can do is quote my FAQ part on this;

Removing outliers can cause your data to become more normal but contrary to what is sometimes perceived, outlier removal is subjective, there is no real objective way of removing outliers.

Always remember that these points remain observations and you should not just throw them out on a whim. Instead you should have good reasons to remove your outliers. There may be many truly valid reasons to remove data-points. These include outliers caused by measurement errors, incorrectly entered data-points or impossible values in real life. If you feel that any outlier are erroneous data points and you can validate this, then you should feel free to remove them.

On the other hand, if you see no reason why your outliers are erroneous measurements then there is no truly objective way to remove them. They are true observations and you may have to consider that the assumptions of your test do not correspond to the reality of your situation. You could always try a non-parametric test (which in general are less sensitive to outliers) or some other analysis that does not require the assumption that your data is normally distributed.

Deletion of outlier data is a controversial practice frowned on by many scientists and science instructors

Last edited:

if dumping outliers is good enough for NASA then it's good enough for me

nah, in all seriousness. i was really into this (and other good statistical practices) like a few years ago... then when i started doing stats consulting for students and profs alike i realised everybody was dumping them (or tinkering with their data in other unspeakable ways) and kept on saying no... then the weeks became months and the months became years and i started noticing that even after people had gone through the mandatory research methods courses where they were instructed to

i felt too tired to swim against the current and just lost interest. i now know it shouldn't be done and i guess i'm quite happy with that.

Sure the measurements are murky and the type of data you work with isn't particularly nice... but I don't see it as a good thing if people use a method that basically makes the statistical portion of the analysis... worthless. To me that doesn't sound like science - it's more like wishful thinking... "I want this theory to be right so I'll just manipulate the data to make it more clear that my theory is correct".

Throw them away?

Pretend that all the data are always correct?

I think that it is a basically good scientific attitude to be skeptical towards the data. They know that “maybe we have done something wrong here” and they want to check!

One procedure I have heard of is that it they get a very high value, then they measure it again. And if the second measurement is also large, then they throw away the second value and keep the first, since it seemed to be okay. Else keep the second.

What about that procedure?

Throw them away?

Pretend that all the data are always correct?

Look at how this guys deals with outliers (3 min): http://www.youtube.com/watch?v=GXy__kBVq1M

This certainly seems like a more "healthy" way to think about outliers.

Sure they may be wrong, but they may also be telling you something.

So, what do the above writers suggest to do about outliers?

One procedure I have heard of is that it they get a very high value, then they measure it again. And if the second measurement is also large, then they throw away the second value and keep the first, since it seemed to be okay. Else keep the second.

Practical: Not everyone has the resources to measure again.

Ethical: It may not be ethical to subject a patient or even an animal to a painful procedure a second time.

Outside the realm of reality: Deep sea sonar measurements cant be repeated as the time of measurement has passed. Was the bloop event an outlier? We can't go back in time to measure it again, so who knows?

Now for each of these examples there will be logical checks and methods to scrutinize but it should be clear there are no one size fits all solutions.

The procedure of removing everything more than 2 or 3 standard deviations away, or removing points you think lie to far off the line of fit so "they must be wrong".. is however NOT good science practise and quite frankly borders on scientific misconduct. If you do this, you should clearly report it, otherwise it IS scientific misconduct.

And that is my two cents,

TE

EDIT: This post was the embodiment of "I live the way I type; fast, with a lot of mistakes."

Last edited: