# Thread: Linear regression with non-normal data?

1. ## Linear regression with non-normal data?

It is my understanding that for linear regression both data sets must be normally distributed.

One of my two sets of data (x) is normally distributed, but the second (y) has some values that could be outliers. When I make a histogram of my y data it looks something like this:

(If the image does not show in the message post you can see it at this link: http://tinypic.com/r/w14ldc/5)
When I remove the outliers (to the right) the histogram looks like a normal distribution (the data also meets other tests of a normal distribution).

Here is my conundrum.

There is very, very little difference for r squared and P from the linear regression between leaving the outliers in and taking them out. Linear regression with the outliers left in the data results in an r squared of 0.201 and a P < 0.00001. Linear regression with the outliers removed results in an r squared of 0.198 and still a P <0.00001. The only difference is my resulting y=mx+b equation and the y equation works a lot better when the outliers are left in my analysis.

So given the extremely significant P values and very little r squared difference do I still have to remove the outliers? Or under these circumstances would using linear regression still be appropriate even though my y data does not appear to be from a normal distribution?

2. ## Re: Linear regression with non-normal data?

Originally Posted by GrungyGoose
It is my understanding that for linear regression both data sets must be normally distributed.
Your understanding is incorrect. We don't make any assumptions on how the predictors are distributed. And we don't make any assumptions about the marginal distribution of the y values. What we want to be normally distributed are the error terms. You can check this assumption by running the regression and then checking the residuals you get from the model.

3. ## The Following User Says Thank You to Dason For This Useful Post:

victorxstc (11-15-2011)

4. ## Re: Linear regression with non-normal data?

In practice, if not theory, if the univariate variables are heavily skewed (particularly if they are skewed in the opposite direction from the dependent variable they predict) it will create signficant problems for many methods such as SEM and some regressions (ANOVA less so if the sample size is high enough). It it common in well known statistical text to use univariate test of normality even though this is not technically correct. That is because multivariate tests exist on very few softwares (Mardia's multivariate test for example is not on SAS or SPSS).

There is an ongoing dispute in the statistical literature on how robust various methods are when multivariate normality is violated. Different authors say very different things on this topic (robustness reflects if the method will work if the assumption such as normality is violated).

5. ## Re: Linear regression with non-normal data?

Originally Posted by noetsi
It it common in well known statistical text to use univariate test of normality even though this is not technically correct.
To test the normality of what exactly?

6. ## Re: Linear regression with non-normal data?

To test if the assumption of normality in the method is met. For example if you run regression (which assumes multi-variate normality) a test often reccomended is univariate skewness tests. I know that this is not formally correct, and the authors state this, but they reccomend it as better than the alternative which is not to test this at all. Few know how to test multivariate skewness etc.

7. ## Re: Linear regression with non-normal data?

Regression doesn't assume multivariate normality. It assumes that the dependent variable conditioned on the independent variables are normally distributed. In other words the assumption of normality is on the error term.

8. ## Re: Linear regression with non-normal data?

The basic assumptions of multivariate regression are 1) multivariate normality of the residuals, 2) homogenous variances of residuals conditional on predictors, 3) common covariance structure across observations, and 4) independent observations. Unfortunately, testing the first three assumptions is very difficult.

Regression assumes that variables have normal distributions. Non-normally distributed variables (highly skewed or kurtotic variables, or variables with substantial outliers) can distort relationships and significance tests. There are several pieces of information that are useful to the researcher in testing this assumption: visual inspection of data plots, skew, kurtosis, and P-P plots give researchers information about normality, and Kolmogorov-Smirnov tests provide inferential statistics on normality. Outliers can be identified either through visual inspection of histograms or frequency distributions, or by converting data to z-scores.

http://pareonline.net/getvn.asp?v=8&n=2

I still don't understand the distinction you make between multivariate normality and multivariate normality of the residuals, but in any case (perhaps as a short hand) it is common to say that regression requires multivariate normality and do the test I noted previously.

I am not questioning you are right. Just pointing out the way the method is commonly described. Which is in terms of multi-variate normality.

9. ## The Following User Says Thank You to noetsi For This Useful Post:

victorxstc (11-15-2011)

10. ## Re: Linear regression with non-normal data?

Originally Posted by noetsi

http://pareonline.net/getvn.asp?v=8&n=2

I still don't understand the distinction you make between multivariate normality and multivariate normality of the residuals, but in any case (perhaps as a short hand) it is common to say that regression requires multivariate normality and do the test I noted previously.

I am not questioning you are right. Just pointing out the way the method is commonly described. Which is in terms of multi-variate normality.
Why are you talking about multivariate techniques? It was my understanding we only had a single response variable in this situation...

I was just making the point that we typically don't give a **** what the actual response variable is distributed as. We make the assumption of normality on the error term - not the variables themselves. If for some reason you were in a situation where you assumed that everything in sight is multivariate normal (which isn't typically what is done in a typical regression setting) then you could check the marginal distribution of everything (because I agree it's harder to asses multivariate normality - although there are decent techniques) instead of the residuals because in that situation it doesn't matter. In the typical regression setting we don't require normality of either x or y.

11. ## The Following 2 Users Say Thank You to Dason For This Useful Post:

jonathaned55 (11-13-2012), victorxstc (11-15-2011)

12. ## Re: Linear regression with non-normal data?

We make the assumption of normality on the error term - not the variables themselves.
This is the part that confuses me as the reccomended test for multivariate normality in regression is commonly to test the variables not to look at the residuals of the regression. I have read and been taught this numerous times. But perhaps, as I suggested earlier, this is simply because commerical software does not support test such as Mardia's test of multivariate skewness.

13. ## Re: Linear regression with non-normal data?

Yeah but once again you're talking about multivariate regression. That probably is the 'norm' for multivariate stuff - I don't deal with that very often. Like I mentioned I was under the impression we were talking about a single response variable.

14. ## Re: Linear regression with non-normal data?

Originally Posted by noetsi
This is the part that confuses me as the reccomended test for multivariate normality in regression is commonly to test the variables not to look at the residuals of the regression. I have read and been taught this numerous times. But perhaps, as I suggested earlier, this is simply because commerical software does not support test such as Mardia's test of multivariate skewness.
I think there is some confusion here. The topic is linear regression with one response variable/DV and one or more predictors. The first source you are citing refers to multivariate regression, a less-used technique with multiple response variables. The second source is probably talking about multiple linear regression, but is wrong. The fact that SPSS and SAS don't support Mardia's coefficient doesn't really matter in relation to multiple linear regression, because multiple linear regression does not assume multivariate normality. All the researcher has to do is run the regression, save the residuals, and then apply conventional assessments of normality (and other assumptions) to the residuals - e.g. skew, kurtosis, q-q plots, Shapiro-Wilk if he or she is that way inclined, etc.

15. ## The Following 2 Users Say Thank You to CowboyBear For This Useful Post:

jonathaned55 (11-13-2012), victorxstc (11-15-2011)

16. ## Re: Linear regression with non-normal data?

As Dason and I discussed in pm's, the term multivariate regression is often applied to regression where you have more than one variable regardless of whether the variables are independent and dependent. I used it in that context.

My Sage monograph notes that while Gaus-Markov does not require normality, statistical test do with smaller sample sizes. But, again, my point is that there is a lot of literature that raises normality in the context of linear regression (by which I mean regression with only one dependent variable). Particularly skewness. So if skewness of variables in the model is not an issue, there are a lot of confuses people

17. ## Re: Linear regression with non-normal data?

But once again note that the normality assumption is on the error term and not on the variables themselves...

18. ## Re: Linear regression with non-normal data?

Originally Posted by noetsi
But, again, my point is that there is a lot of literature that raises normality in the context of linear regression (by which I mean regression with only one dependent variable). Particularly skewness. So if skewness of variables in the model is not an issue, there are a lot of confuses people
You may be right that there sources out there that say (in error) that multiple linear regression requires that the response and/or predictors need to be normally distributed. I think part of the problem is that a lot of "statistics" texts aimed at people in various social and life sciences are written by non-statisticians. People (like me) with some applied research knowledge but without the knowledge of statistical theory to back it up may find it harder to fully explicate what the actual specific assumptions of particular tests are and what the consequences of their violation are. But anyway, I think it's important that on this forum we try to give information that is as accurate as possible (regardless of what myths happen to be widespread in entry-level texts).

19. ## The Following User Says Thank You to CowboyBear For This Useful Post:

victorxstc (11-15-2011)

20. ## Re: Linear regression with non-normal data?

I am in a graduate statistics program in education and testing variables (not the residuals) for skewness is an important part of the regression course. Indeed we did not learn a way to analyze multivariate skewness from the residuals. I will also cite Tabachnick and Fidel from their statistics text "Using Multivariate Statistics" which has gone through 5 editions (suggesting it is pretty popular).

"Screening continuous variables for normality is an important early steop in almost every multivariate analysis....Although normality of the variables is not always required for analysis...The solution is degraded if the the variables are not normally distributed...."

However, it is true that in their multiple regression section they note that one can analyze the residuals as an alternative to analyzing the variables.

21. ## The Following User Says Thank You to noetsi For This Useful Post:

victorxstc (11-15-2011)

Page 1 of 3 1 2 3 Last

 Tweet