Regression Analysis - what to do if population data is not normally distributed?

#1
Hi all

I'd like to conduct a regression analysis using the entire population's data which I have at my disposal. However, the data of the dependant (Y) variable is not normally distributed (I have run a chi-squared test for normality to confirm this). How do I deal with this situation? Since it cannot be a case of sampling error (the data represents the entire population), I'm not sure what to do. I have eliminated any outliers, but that hasn't made a difference.

Can I go ahead and run the regression analysis even though the Y-variable's data isn't normally distributed? I have of course done this already, for curiosity, and I found that the residuals from that analysis were also not normally distributed, which is not surprising I guess.

Any advice at all would be appreciated.

Thanks!
 
#2
That last fact is the most important for regression i.e. the residuals were not normally distributed. If this is the case, then simple linear regression may not be best. The values for the dependent variable do not need to be normally distributed, nor do the independent variables.

It is interesting you said that the data represent the entire population. In this case, you are not estimating any parameters, i.e. the correlation coefficient. Since you have data from the entire population, you know the population correlation coefficient, no confidence interval, no test, it just is what it is. Speaking in these terms, I feel that by doing a regression, you are just finding a line that best fits the data. Typically people will then find a confidence interval for coefficients. In your case this is not necessary, because the coefficient you come up with is the population coefficient. I would go with the regression if it tells the story you want to tell about the data. But keep the above discussion in mind when you need to explain.

Disclaimer: Others may disagree with my assessment, including statistics instructors. Perhaps one of the moderators of this forum will comment.

~Matt
 
#3
Hi Matt

Thanks very much for your response, I was somehow under the impression that SLR required a normally distributed X & Y variable, but now that you mention it, I can't find anything to that effect in any of the text books. Thanks for pointing that out!

That leaves me with the residual analysis. I have heard some people say that the residuals don't need to be normally distributed, as long as they have a mean of zero (or thereabouts) and as long as the distribution in the scatter plot meets the visual inspection requirements (homoscedascicty, etc.). However, others seem to say that if the residuals are not normally distributed, then there's a problem with the model itself. Can anyone comment on this?

Analysis of the residuals is shown below:

Mean -0.00000000000000064595
Standard Error 0.628497835
Median -0.626936348
Mode #N/A
Standard Deviation 3.610445184
Sample Variance 13.03531443
Kurtosis 0.48906549
Skewness 1.053273862
Range 13.1174862
Minimum -4.487686737
Maximum 8.62979946
Sum -2.13163E-14
Count 33
Confidence Level(95.0%) 1.28020819

Chi-Squared Test of Normality
Intervals Probability Expected Observed
(z <= -1) 0.158655 5.235615 4
(-1 < z <= 0) 0.341345 11.264385 15
(0 < z <= 1) 0.341345 11.264385 8
(z > 1) 0.158655 5.235615 6

chi-squared Stat 2.5881
df 1
p-value 0.1077
chi-squared Critical 3.8415


Many thanks!
 
#4
Hmm, what software did you use to produce these statistics? It appears that the test for normality is some sort of goodness of fit test. Since the p-value of this test was > 0.05,

chi-squared Stat 2.5881
df 1
p-value 0.1077
chi-squared Critical 3.8415

it would indicate that we should fail to rejet the null hypothesis. That is, fail to reject the hypothesis that these data are normally distributed. Which means that the regression may be OK. Why is it that you found these to not be normally distributed?

~Matt