Multivariate normality

noetsi

Fortran must die
I know various methods of determing if the bivariate relationship of two variables are normal. But, as Dason has drilled into me, its multivariate normality (that is the normality of the residuals) that actually matters.

I am not sure how to test for that. If you plotted the residuals into a QQ plot and it suggested normality, would that be a valid way to be sure the regression model had multivariate normality?

TheEcologist

R purist
Although I am not particularly a fan of normality tests, I know that tests like the Shapiro-Wilk Multivariate Normality Test exist.

A quick google search also shows there are R packages (PDF alert).

Maybe you could look into the theory behind these tests and comeup with something satisfactory?

Dason

I think you still have a misunderstanding. The reason I kept mentioning multivariate normality is because you were talking about the response variable being normally distributed. Some authors might say that Y needs to be normal but if that's the case then they're talking about multivariate normality for Y where the mean vector is a function of X. If all we want to do is check the normality assumption we can stick to univariate normal tests because the previous statement is the same as asking if the residuals are univariately normally distributed...

spunky

Smelly poop man with doo doo pants.
the psych package in R also has mardia's test of multivariate skewness/kurtosis where, if statistically significant, gives you evidence to suspect your distribution is not multivariate normal.

i know noetsi uses Mplus, and Mplus also gives you mardia's test.

now, for what reason in particular do you need to test for multivariate normality?

Dason

.... because...?
I can't answer for him but I feel similarly. Typically they aren't that great with small sample sizes and once you get a large enough sample size to detect departure from normality then you have a large enough sample to not care about normality...

noetsi

Fortran must die
I don't use them because they have a strong reputation for very weak power.

I can not use Mplus at work (the state will not allow it's purchase nor let me purchase it personally and place it on the computer - don't ask why) and it will be a while before I learn R.

Is it legitimate to use the residuals of a regression in a QQ plot to test for normality?

once you get a large enough sample size to detect departure from normality then you have a large enough sample to not care about normality...
Why would that ever be true? I understand the CLM comes into play at a certain point, but I read treatments all the time in the literature about normality and I have almost never seen one argue that at a certain sample size normality does not matter.

TheEcologist

R purist
I can't answer for him but I feel similarly. Typically they aren't that great with small sample sizes and once you get a large enough sample size to detect departure from normality then you have a large enough sample to not care about normality...
Exactly, also once you have a large sample size these tests also tend detect significant "non-normality" when departures from normality are meaningless.

I mean look what one outlier in 5000 does to a shapiro.test

Code:
# test once
shapiro.test(c(rnorm(4999),-5.5))
# test 100 times
pvals<-replicate(100,shapiro.test(c(rnorm(4999),-5.5))\$p.value)
plot(density(pvals))
abline(v=0.05,col='red')
You can bet your pretty pink panties that the "sampling distribution" of the above is normal.

noetsi

Fortran must die
Of course years ago I read a Stanford professor argue outliers could make the results of ANOVA invalid regardless of the sample size (that is asymptotic methods were no protection against bias - he argued the bias could actually get worse with larger samples given this issue).

trinker

ggplot2orBust
I don't have pretty pink panties. Did I not get my pair with TS membership?

trinker

ggplot2orBust
I can't remember who (link or vinux maybe) wrote a post on the old statspedia where ever that got too about multivariate normality.

spunky

Smelly poop man with doo doo pants.
large enough sample size to detect departure from normality then you have a large enough sample to not care about normality...
uhm... i can see how this is true in the case of least-squares but would it also apply for ML? i remember reading in Pawitan's classic book on everything-you-need-to-know-about-maximum-likelihood that the choice of likelihood could (or could not) create a whole bunch of problems in terms of parameter bias, etc. so i do think that multivariate normality should be satisfied (at least as much as possible) if one is choosing a normal likelihood model, or something similar to it.

spunky

Smelly poop man with doo doo pants.
Exactly, also once you have a large sample size these tests also tend detect significant "non-normality" when departures from normality are meaningless.
oh pff... that's just bad data practice not to check (and dump) outliers before the analysis

TheEcologist

R purist

oh pff... that's just bad data practice not to check (and dump) outliers before the analysis
I'm sorry but what you are describing is actually bad practise, IMO very bad practise. It's sad that this is still taught as "common statistical sense" is some fields.

You should only ever "dump" outliers, kicking and screaming, being very certain they are errors.
You should certainly not dump them on a reflex!

Best thing I can do is quote my FAQ part on this;

How do I remove or deal with outliers?

Removing outliers can cause your data to become more normal but contrary to what is sometimes perceived, outlier removal is subjective, there is no real objective way of removing outliers.

Always remember that these points remain observations and you should not just throw them out on a whim. Instead you should have good reasons to remove your outliers. There may be many truly valid reasons to remove data-points. These include outliers caused by measurement errors, incorrectly entered data-points or impossible values in real life. If you feel that any outlier are erroneous data points and you can validate this, then you should feel free to remove them.

On the other hand, if you see no reason why your outliers are erroneous measurements then there is no truly objective way to remove them. They are true observations and you may have to consider that the assumptions of your test do not correspond to the reality of your situation. You could always try a non-parametric test (which in general are less sensitive to outliers) or some other analysis that does not require the assumption that your data is normally distributed.
Or from Wikipedia
Deletion of outlier data is a controversial practice frowned on by many scientists and science instructors
Outliers are data, dumping them is bad data practise and you should feel very dirty evertime you do it without a very good reason.

Last edited:

spunky

Smelly poop man with doo doo pants.
oh, pff....

if dumping outliers is good enough for NASA then it's good enough for me

nah, in all seriousness. i was really into this (and other good statistical practices) like a few years ago... then when i started doing stats consulting for students and profs alike i realised everybody was dumping them (or tinkering with their data in other unspeakable ways) and kept on saying no... then the weeks became months and the months became years and i started noticing that even after people had gone through the mandatory research methods courses where they were instructed to not do it... they were still doing it.

i felt too tired to swim against the current and just lost interest. i now know it shouldn't be done and i guess i'm quite happy with that.

spunky

Smelly poop man with doo doo pants.
the world in my field is a lot more similar to what noetsi described than i thought it was a few years ago, particularly because of how closely it is tied to politics and how difficult it is to measure any kind of outcome (actually, of how difficult it is to measure anything). i think that's why you see so many misconceptions (like the perennial one of normality in OLS regression) arise over and over again in the social sciences. people don't see statistics as an important tool to learn to contribute towards science, but as a hurdle that needs to be jumped over in order to get published, and since our measurements are not nearly as precise as those in the natural sciences, things get really murky very quickly when trying to evaluate the relative merits of one theory over another.

Dason

I guess that's disheartening if apparently one of the few that knows that you shouldn't remove outliers for no good reason does it anyways.

Sure the measurements are murky and the type of data you work with isn't particularly nice... but I don't see it as a good thing if people use a method that basically makes the statistical portion of the analysis... worthless. To me that doesn't sound like science - it's more like wishful thinking... "I want this theory to be right so I'll just manipulate the data to make it more clear that my theory is correct".

GretaGarbo

Human
So, what do the above writers suggest to do about outliers?

Throw them away?

Pretend that all the data are always correct?

I think that it is a basically good scientific attitude to be skeptical towards the data. They know that “maybe we have done something wrong here” and they want to check!

One procedure I have heard of is that it they get a very high value, then they measure it again. And if the second measurement is also large, then they throw away the second value and keep the first, since it seemed to be okay. Else keep the second.

TheEcologist

R purist
Throw them away?
All I am saying don't do it on a whim.

Pretend that all the data are always correct?
You can apply the same logic in the reverse "why pretend they are wrong?" Simple fact is that there are a ton of natural phenomenon we humans can measure that will contain outliers, even if we measures perfectly and sample randomly. In some cases the outliers are actually the most important points. Here is a simple example think about the income among families in the US, apply the same outlier removal logic to these and you'll quickly throw out 50% of the total national private wealth - and you can then conclude all kinds of erroneous things about income equality in the states. That is a simple example, but how many times have theories been supported with data where high influence outliers were removed and the consequences were not as clear as the above?

Look at how this guys deals with outliers (3 min): http://www.youtube.com/watch?v=GXy__kBVq1M

This certainly seems like a more "healthy" way to think about outliers.
Sure they may be wrong, but they may also be telling you something.

So, what do the above writers suggest to do about outliers?
Nobody ever said to not be skeptical. We are actually applying the same level of scepticism towards removal of outliers. If they are true observations and not erroneous, you should be adjusting your model (which often in practical situations of everyday research means adjusting the corresponding distribution with central measure). In most simple cases, if the mean is influenced too strongly by a few points (that are real observations), you should switch to a measure that is more appropriate for your data. Often a robust measure like the median will work much better, without you having to revert to adjusting reality to your model. Also, many distributions can happily account for real data points that may be considered outliers under the normal model.

One procedure I have heard of is that it they get a very high value, then they measure it again. And if the second measurement is also large, then they throw away the second value and keep the first, since it seemed to be okay. Else keep the second.
The procedure you outline is an accepted and common way, but whether it is practical, ethical or even possible in the realm of reality depends on the subject understudy.

Examples
Practical: Not everyone has the resources to measure again.
Ethical: It may not be ethical to subject a patient or even an animal to a painful procedure a second time.
Outside the realm of reality: Deep sea sonar measurements cant be repeated as the time of measurement has passed. Was the bloop event an outlier? We can't go back in time to measure it again, so who knows?

Now for each of these examples there will be logical checks and methods to scrutinize but it should be clear there are no one size fits all solutions.

The procedure of removing everything more than 2 or 3 standard deviations away, or removing points you think lie to far off the line of fit so "they must be wrong".. is however NOT good science practise and quite frankly borders on scientific misconduct. If you do this, you should clearly report it, otherwise it IS scientific misconduct.

And that is my two cents,

TE

EDIT: This post was the embodiment of "I live the way I type; fast, with a lot of mistakes."

Last edited: