Linear regression with non-normal data?

#1
It is my understanding that for linear regression both data sets must be normally distributed.

One of my two sets of data (x) is normally distributed, but the second (y) has some values that could be outliers. When I make a histogram of my y data it looks something like this:


(If the image does not show in the message post you can see it at this link: http://tinypic.com/r/w14ldc/5)
When I remove the outliers (to the right) the histogram looks like a normal distribution (the data also meets other tests of a normal distribution).

Here is my conundrum.

There is very, very little difference for r squared and P from the linear regression between leaving the outliers in and taking them out. Linear regression with the outliers left in the data results in an r squared of 0.201 and a P < 0.00001. Linear regression with the outliers removed results in an r squared of 0.198 and still a P <0.00001. The only difference is my resulting y=mx+b equation and the y equation works a lot better when the outliers are left in my analysis.

So given the extremely significant P values and very little r squared difference do I still have to remove the outliers? Or under these circumstances would using linear regression still be appropriate even though my y data does not appear to be from a normal distribution?
 

Dason

Ambassador to the humans
#2
It is my understanding that for linear regression both data sets must be normally distributed.
Your understanding is incorrect. We don't make any assumptions on how the predictors are distributed. And we don't make any assumptions about the marginal distribution of the y values. What we want to be normally distributed are the error terms. You can check this assumption by running the regression and then checking the residuals you get from the model.
 

noetsi

No cake for spunky
#3
In practice, if not theory, if the univariate variables are heavily skewed (particularly if they are skewed in the opposite direction from the dependent variable they predict) it will create signficant problems for many methods such as SEM and some regressions (ANOVA less so if the sample size is high enough). It it common in well known statistical text to use univariate test of normality even though this is not technically correct. That is because multivariate tests exist on very few softwares (Mardia's multivariate test for example is not on SAS or SPSS).

There is an ongoing dispute in the statistical literature on how robust various methods are when multivariate normality is violated. Different authors say very different things on this topic (robustness reflects if the method will work if the assumption such as normality is violated).
 

noetsi

No cake for spunky
#5
To test if the assumption of normality in the method is met. For example if you run regression (which assumes multi-variate normality) a test often reccomended is univariate skewness tests. I know that this is not formally correct, and the authors state this, but they reccomend it as better than the alternative which is not to test this at all. Few know how to test multivariate skewness etc.
 

Dason

Ambassador to the humans
#6
Regression doesn't assume multivariate normality. It assumes that the dependent variable conditioned on the independent variables are normally distributed. In other words the assumption of normality is on the error term.
 

noetsi

No cake for spunky
#7
The basic assumptions of multivariate regression are 1) multivariate normality of the residuals, 2) homogenous variances of residuals conditional on predictors, 3) common covariance structure across observations, and 4) independent observations. Unfortunately, testing the first three assumptions is very difficult.
http://www.ats.ucla.edu/stat/sas/library/multivariate_regrssn.htm

Regression assumes that variables have normal distributions. Non-normally distributed variables (highly skewed or kurtotic variables, or variables with substantial outliers) can distort relationships and significance tests. There are several pieces of information that are useful to the researcher in testing this assumption: visual inspection of data plots, skew, kurtosis, and P-P plots give researchers information about normality, and Kolmogorov-Smirnov tests provide inferential statistics on normality. Outliers can be identified either through visual inspection of histograms or frequency distributions, or by converting data to z-scores.
http://pareonline.net/getvn.asp?v=8&n=2

I still don't understand the distinction you make between multivariate normality and multivariate normality of the residuals, but in any case (perhaps as a short hand) it is common to say that regression requires multivariate normality and do the test I noted previously.

I am not questioning you are right. Just pointing out the way the method is commonly described. Which is in terms of multi-variate normality.
 

Dason

Ambassador to the humans
#8
http://www.ats.ucla.edu/stat/sas/library/multivariate_regrssn.htm



http://pareonline.net/getvn.asp?v=8&n=2

I still don't understand the distinction you make between multivariate normality and multivariate normality of the residuals, but in any case (perhaps as a short hand) it is common to say that regression requires multivariate normality and do the test I noted previously.

I am not questioning you are right. Just pointing out the way the method is commonly described. Which is in terms of multi-variate normality.
Why are you talking about multivariate techniques? It was my understanding we only had a single response variable in this situation...

I was just making the point that we typically don't give a **** what the actual response variable is distributed as. We make the assumption of normality on the error term - not the variables themselves. If for some reason you were in a situation where you assumed that everything in sight is multivariate normal (which isn't typically what is done in a typical regression setting) then you could check the marginal distribution of everything (because I agree it's harder to asses multivariate normality - although there are decent techniques) instead of the residuals because in that situation it doesn't matter. In the typical regression setting we don't require normality of either x or y.
 

noetsi

No cake for spunky
#9
We make the assumption of normality on the error term - not the variables themselves.
This is the part that confuses me as the reccomended test for multivariate normality in regression is commonly to test the variables not to look at the residuals of the regression. I have read and been taught this numerous times. But perhaps, as I suggested earlier, this is simply because commerical software does not support test such as Mardia's test of multivariate skewness.
 

Dason

Ambassador to the humans
#10
Yeah but once again you're talking about multivariate regression. That probably is the 'norm' for multivariate stuff - I don't deal with that very often. Like I mentioned I was under the impression we were talking about a single response variable.
 

CB

Super Moderator
#11
This is the part that confuses me as the reccomended test for multivariate normality in regression is commonly to test the variables not to look at the residuals of the regression. I have read and been taught this numerous times. But perhaps, as I suggested earlier, this is simply because commerical software does not support test such as Mardia's test of multivariate skewness.
I think there is some confusion here. The topic is linear regression with one response variable/DV and one or more predictors. The first source you are citing refers to multivariate regression, a less-used technique with multiple response variables. The second source is probably talking about multiple linear regression, but is wrong. The fact that SPSS and SAS don't support Mardia's coefficient doesn't really matter in relation to multiple linear regression, because multiple linear regression does not assume multivariate normality. All the researcher has to do is run the regression, save the residuals, and then apply conventional assessments of normality (and other assumptions) to the residuals - e.g. skew, kurtosis, q-q plots, Shapiro-Wilk if he or she is that way inclined, etc.
 

noetsi

No cake for spunky
#12
As Dason and I discussed in pm's, the term multivariate regression is often applied to regression where you have more than one variable regardless of whether the variables are independent and dependent. I used it in that context.

My Sage monograph notes that while Gaus-Markov does not require normality, statistical test do with smaller sample sizes. But, again, my point is that there is a lot of literature that raises normality in the context of linear regression (by which I mean regression with only one dependent variable). Particularly skewness. So if skewness of variables in the model is not an issue, there are a lot of confuses people :)
 

CB

Super Moderator
#14
But, again, my point is that there is a lot of literature that raises normality in the context of linear regression (by which I mean regression with only one dependent variable). Particularly skewness. So if skewness of variables in the model is not an issue, there are a lot of confuses people :)
You may be right that there sources out there that say (in error) that multiple linear regression requires that the response and/or predictors need to be normally distributed. I think part of the problem is that a lot of "statistics" texts aimed at people in various social and life sciences are written by non-statisticians. People (like me) with some applied research knowledge but without the knowledge of statistical theory to back it up may find it harder to fully explicate what the actual specific assumptions of particular tests are and what the consequences of their violation are. But anyway, I think it's important that on this forum we try to give information that is as accurate as possible (regardless of what myths happen to be widespread in entry-level texts).
 

noetsi

No cake for spunky
#15
I am in a graduate statistics program in education and testing variables (not the residuals) for skewness is an important part of the regression course. Indeed we did not learn a way to analyze multivariate skewness from the residuals. I will also cite Tabachnick and Fidel from their statistics text "Using Multivariate Statistics" which has gone through 5 editions (suggesting it is pretty popular).

"Screening continuous variables for normality is an important early steop in almost every multivariate analysis....Although normality of the variables is not always required for analysis...The solution is degraded if the the variables are not normally distributed...."

However, it is true that in their multiple regression section they note that one can analyze the residuals as an alternative to analyzing the variables.
 

CB

Super Moderator
#16
Indeed we did not learn a way to analyze multivariate skewness from the residuals.
I did say earlier that you don't need to analyze multivariate skewness of the residuals for multiple linear regression. You only need to look at the univariate distribution of the residuals, which can be done in a very similar way to the way that you have learnt to assess the normality of individual variables.

I will also cite Tabachnick and Fidel from their statistics text "Using Multivariate Statistics" which has gone through 5 editions (suggesting it is pretty popular).

"Screening continuous variables for normality is an important early steop in almost every multivariate analysis....Although normality of the variables is not always required for analysis...The solution is degraded if the the variables are not normally distributed...."
I think this reinforces my point earlier. T&F is a very popular and very readable entry level text, but both of the authors are psychologists (not statisticians). In this case they're making a claim that I think is too vague to have any useful meaning. Different multivariate analyses have different assumptions, most of which don't refer to the distributions of individual variables (or at least not exclusively). To make a general claim that the "solution is degraded" (what does this even mean?) when non-normal variables are used in multivariate analysis, regardless of analysis type, doesn't sound very sensible to me - comments, Dason?
 

noetsi

No cake for spunky
#17
My point, as it has been all along, is that formally they are probably wrong. But what they are saying (and plenty of other people say it in classes and statistical text) is the norm not the exception in what the 99 percent of non-academics (and virtually all academics who are not stasticians) encounter.

I would not call their text "introductory" (except perhaps to true statiscians). Most academics, let alone the rest of the population, will never reach the level in the book. I did not for most of four graduate programs including a PHD :) Indeed what they say is said in my graduate statistics program.

I took this way off the track despite discussing this with Dason by pm. My point is not what is factually correct. It is what most encounter and believe who don't have doctorates in statistics.
 

CB

Super Moderator
#18
I would not call their text "introductory" (except perhaps to true statiscians). Most academics, let alone the rest of the population, will never reach the level in the book. I did not for most of four graduate programs including a PHD :) Indeed what they say is said in my graduate statistics program.
Interesting. I guess it's a fuzzy definition. I would see it as being an introduction to multivariate data analysis, though perhaps not appropriate as a very first statistics text. Amazon calls it "an introduction to the most commonly encountered statistical and multivariate techniques".

My point, as it has been all along, is that formally they are probably wrong. But what they are saying (and plenty of other people say it in classes and statistical text) is the norm not the exception in what the 99 percent of non-academics (and virtually all academics who are not stasticians) encounter.
Yep: this a theme you talked about in the MLE thread too, IIRC. I don't necessarily disagree that many academics hold these views (I dunno about 99!) What's probably more important, though, is what the significance of that is. I get the feeling that your view is that if a majority of academics use a particular technique or have a particular belief, than that perhaps is something of a justification for us or other researchers to use that technique or believe that thing too. That's not something I agree with. A false belief doesn't become true through repetition; a test that gives the wrong answers won't start giving the right ones just because we like it a lot. If that isn't what you're getting at, I apologise!

I took this way off the track despite discussing this with Dason by pm. My point is not what is factually correct. It is what most encounter and believe who don't have doctorates in statistics.
Meh... I think the boards should be here for interesting discussions, not just quick dry answers to poster's questions :)
 

noetsi

No cake for spunky
#19
99 +- 20 percent :)

to me the significance of it is:

1) Statisticians need to let academics who are not that know they are doing this stuff wrong. Perhaps they need a stats help line (that would never work).
2) Much of the literature on statisics is (as best I can judge) incorrect. But the people using it, practisioners and academics alike, are unlikely to know this. I find that very concerning which is why I bring it up.
3) Statisticians disagree a lot among themself. I am not sure if this is one of those areas or not.

I like, obviously, to discuss interesting stuff as well. But I feel a little guilty if I think I might be messing up the original poster by raising these issues. I take consolation from the fact that they probably forgot it long ago since they have not posted here in while :)

I get the feeling that your view is that if a majority of academics use a particular technique or have a particular belief, than that perhaps is something of a justification for us or other researchers to use that technique or believe that thing too.
Actually I don't believe that at all. I really am concerned, given the amount of time I spend on reading stats and taking classes in it, just how often things I swore were true may not be. It makes me wonder just how much analysis in academics and practisioner, is simply invalid. Because it is based on false information about statistics.
 

Dason

Ambassador to the humans
#20
1) Statisticians need to let academics who are not that know they are doing this stuff wrong. Perhaps they need a stats help line (that would never work).
I actually (semi) provided this service last year. Consulting was available for anybody that wanted it in the Ag department. We used to help the engineers too but then they stopped supplementing our funding so we stopped helping them. So it wasn't free but the people that actually came to us didn't have to pay a direct fee.
3) Statisticians disagree a lot among themself. I am not sure if this is one of those areas or not.
This particular topic isn't really one that is disagreed on. A linear model has the form Y = XB + e where E[e] = 0 blah blah blah typically we add the assumption of e ~ N(0, sigma^2). Not much to disagree on. It's easy to get messed up on but I don't know of any disagreement because it's easy enough to just point to the theory in this case and be like "nope - we only need normality on the error term" (and actually we don't really even need to for OLS to "work" - only if we care about small sample inference).

I like, obviously, to discuss interesting stuff as well. But I feel a little guilty if I think I might be messing up the original poster by raising these issues. I take consolation from the fact that they probably forgot it long ago since they have not posted here in while :)
If we get far enough off topic and it seems like the OP still cares about their topic one of us mods will merge the off topic stuff into a separate thread (like I did with the MLE thread) so don't worry about it.