# Thread: Linear regression with non-normal data?

1. ## Re: Linear regression with non-normal data?

I enjoyed every bit of this discussion and learned (hopefully) something. Thanks.
+1

Are the errors all pebbles or are they a mix of grains, pebbles, boulders, and mountains? This the best illustration of the problem I read on the subject when reading through the history of regression, which was originally developed independently by Gauss and Adrian Legrande. Gauss was trying to figure out the most likely path of an orbit.

I'm a little confused how the belief that "predictors must be normally distributed" can be reconciled with the simple observation that we often include categorical predictors in multiple regression. It's hard to think of a more non-normal variable than a binary variable! So do people think that those variables are just exempt from the assumption or what?
It seems that the important thing when deciding whether to use regression on a binary variable or not is whether your error rate in assigning 0's and 1's will be normal. If you had a group of people who either earn $10 or$100000 and you 'assign' a 0(earns $10) to a true 1(earns$100000), then regression may not be the best tool to use as your errors will not be normal.

2. ## Re: Linear regression with non-normal data?

But I feel a little guilty if I think I might be messing up the original poster by raising these issues. I take consolation from the fact that they probably forgot it long ago since they have not posted here in while
The discussion (stimulating as it is) has not messed me up - in all honesty I have ignored most of it because to be frank, since I am not a statistician, most of it has been over my head. And come on, my absence has only been about 6 days!

I find amazing is how often I have read, and been taught, that normality is important for linear regression
I do find it a common misconception that the variables in a linear regression model have to be normally distributed.
The overwhelming concensus is that my belief that the variables must be normally distributed is incorrect. So I thank everyone for at least setting me straight on that.

3. ## Re: Linear regression with non-normal data?

It was my belief as well (that it was required for linear regression). And, remarkably, I have been taught that by professors and read it in many statistical links.

One, minor, thing to remember. While the consensus is that it is not required, there is an exception for statistical test with small sample sizes. In that case you need it for statistical test (that is to generate p values that are accurate). But you probably won't run regression very often with that few cases anyway.

4. ## Re: Linear regression with non-normal data?

I hope your not saying you need the predictor variables to be normally distributed for small sample sizes (because that isn't true either). We need the error terms to be normally distributed for small sample sizes. For larger sample sizes we can get away with departures from normality but at no point do we ever require the predictors to be normally distributed.

5. ## Re: Linear regression with non-normal data?

"For larger sample sizes we can get away with departures from normality but at no point do we ever require the predictors to be normally distributed."

A reference, or something ? I ask, because in my regression analyses, the residuals are generally non normally distributed, with the PP plot showing a bow shaped curve, which is really annoying.

Thanks.

6. ## Re: Linear regression with non-normal data?

We get the "for larger sample sizes" part from the asympototic normality of OLS estimates of the regression coefficients: http://en.wikipedia.org/wiki/Proofs_...c_normality_of

Since the OLS estimate is the same as the maximum likelihood estimate in the case of regression you can also use the asymptotic normality of the MLEs if you want.

The part about not requiring normality of the predictors comes from the fact that we don't require any assumptions about the distribution of the predictor when deriving the distribution of the parameter estimates. It's literally just something that we don't require to derive the theory - so there really isn't a proof that we don't need the predictors to be normally distributed other than the fact that we don't need the predictors to be normally distributed to derive all of the properties of the estimates.

With that said we typically want normally distributed errors. And depending on your situation there might be other more appropriate methods to use to analyze the data.

7. ## The Following User Says Thank You to Dason For This Useful Post:

Donald (03-18-2013)

8. ## Re: Linear regression with non-normal data?

Thank you so much !

9. ## Re: Linear regression with non-normal data?

All this have been a great thread. I just want to confirm what I'm reading.

If we have the model

We are saying that is assumed to normally distributed. X does not have to be; y does not need to be (thought it might help).

Or said another way being normally distributed implies that y is normally distributed condition on x.

Correct?

10. ## Re: Linear regression with non-normal data?

Right. Except I think you can even drop this part:
(thought it might help)

11. ## Re: Linear regression with non-normal data?

If I understand correctly, normality is critical in linear regression to whether the p value is valid or not. And commonly among data analyst the p value is the most important thing people care about (the specific level of the IV is rarely critical if its significant - since a substantive interpretation of whether an effect size is large is very difficult to do and comparisons of realtive importance is not simple to do in linear regression or at least not agreed on).

Is it the normality of the error terms or the raw data that would determine the validity of p (or is p not influenced as I assume by normality)?

12. ## Re: Linear regression with non-normal data?

The error term.

13. ## The Following User Says Thank You to Jake For This Useful Post:

noetsi (07-15-2013)

14. ## Re: Linear regression with non-normal data?

So you would want to see if the residual distribution is normal I assume.

15. ## Re: Linear regression with non-normal data?

Originally Posted by noetsi
So you would want to see if the residual distribution is normal I assume.
Yes - that is the typical way to assess that assumption.

16. ## Re: Linear regression with non-normal data?

Originally Posted by noetsi
And commonly among data analyst the p value is the most important thing people care about (the specific level of the IV is rarely critical if its significant - since a substantive interpretation of whether an effect size is large is very difficult to do and comparisons of realtive importance is not simple to do in linear regression or at least not agreed on).
I guess it's hard to determine what people care about most, but a lot of people would argue that the p value is not very important (think of the whole practical vs statistical significance issue).

At the end of the day, all the p value tells you is the probability of observing a coefficient as large or larger than that observed, given that the true population parameter is exactly zero. A lot of the time, the idea that the true parameter is exactly zero is really implausible anyway. Personally I'm usually a lot more interested in point and interval estimates for the parameter.

a substantive interpretation of whether an effect size is large is very difficult to do
I think interpreting coefficients is hard when the variables are measured on arbitrary scales. E.g. regressing score on some psychometric test on score on some other psychometric test. Then we often need to convert coefficients to standardised form, and interpret them in terms of t-shirt sizes (e.g. correlation of 0.5 = "large").

But when the scaling of variables carry some substantive meaning, things aren't so bad.

E.g., consider a regression of income on height in the US given by Baguley, 2010, adapted from Gelman and Hill, 2007.

earnings = 60515 + 1256 x height in inches

(I think earnings are per annum, but am not 100% sure)

Knowing that an extra inch of height is associated with an extra USD1256 of earnings is an interesting piece of information whose importance we can grasp without any extra standardization or complicated interpretive scheme. Heck, standardizing this into a correlation (r = .24) hides the magnitude of the effect, if anything. Furthermore, having a best estimate of the quantity of earnings associated with a one-inch increase in height is a lot more informative than simply knowing that the data would be unlikely if the true relationship was zero (per a p value).

Baguley, Thom. When Correlations Go Bad. The Psychologist 23, no. 2 (2010): 122123.

Page 3 of 3 First 1 2 3

 Tweet