+ Reply to Thread
Results 1 to 5 of 5

Thread: Regression Analysis - what to do if population data is not normally distributed?

  1. #1
    Points: 4,266, Level: 41
    Level completed: 58%, Points required for next Level: 84

    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Regression Analysis - what to do if population data is not normally distributed?




    Hi all

    I'd like to conduct a regression analysis using the entire population's data which I have at my disposal. However, the data of the dependant (Y) variable is not normally distributed (I have run a chi-squared test for normality to confirm this). How do I deal with this situation? Since it cannot be a case of sampling error (the data represents the entire population), I'm not sure what to do. I have eliminated any outliers, but that hasn't made a difference.

    Can I go ahead and run the regression analysis even though the Y-variable's data isn't normally distributed? I have of course done this already, for curiosity, and I found that the residuals from that analysis were also not normally distributed, which is not surprising I guess.

    Any advice at all would be appreciated.

    Thanks!

  2. #2
    TS Contributor
    Points: 4,575, Level: 43
    Level completed: 13%, Points required for next Level: 175

    Location
    Nashville, TN
    Posts
    177
    Thanks
    0
    Thanked 0 Times in 0 Posts
    That last fact is the most important for regression i.e. the residuals were not normally distributed. If this is the case, then simple linear regression may not be best. The values for the dependent variable do not need to be normally distributed, nor do the independent variables.

    It is interesting you said that the data represent the entire population. In this case, you are not estimating any parameters, i.e. the correlation coefficient. Since you have data from the entire population, you know the population correlation coefficient, no confidence interval, no test, it just is what it is. Speaking in these terms, I feel that by doing a regression, you are just finding a line that best fits the data. Typically people will then find a confidence interval for coefficients. In your case this is not necessary, because the coefficient you come up with is the population coefficient. I would go with the regression if it tells the story you want to tell about the data. But keep the above discussion in mind when you need to explain.

    Disclaimer: Others may disagree with my assessment, including statistics instructors. Perhaps one of the moderators of this forum will comment.

    ~Matt

  3. #3
    Points: 4,266, Level: 41
    Level completed: 58%, Points required for next Level: 84

    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Hi Matt

    Thanks very much for your response, I was somehow under the impression that SLR required a normally distributed X & Y variable, but now that you mention it, I can't find anything to that effect in any of the text books. Thanks for pointing that out!

    That leaves me with the residual analysis. I have heard some people say that the residuals don't need to be normally distributed, as long as they have a mean of zero (or thereabouts) and as long as the distribution in the scatter plot meets the visual inspection requirements (homoscedascicty, etc.). However, others seem to say that if the residuals are not normally distributed, then there's a problem with the model itself. Can anyone comment on this?

    Analysis of the residuals is shown below:

    Mean -0.00000000000000064595
    Standard Error 0.628497835
    Median -0.626936348
    Mode #N/A
    Standard Deviation 3.610445184
    Sample Variance 13.03531443
    Kurtosis 0.48906549
    Skewness 1.053273862
    Range 13.1174862
    Minimum -4.487686737
    Maximum 8.62979946
    Sum -2.13163E-14
    Count 33
    Confidence Level(95.0%) 1.28020819

    Chi-Squared Test of Normality
    Intervals Probability Expected Observed
    (z <= -1) 0.158655 5.235615 4
    (-1 < z <= 0) 0.341345 11.264385 15
    (0 < z <= 1) 0.341345 11.264385 8
    (z > 1) 0.158655 5.235615 6

    chi-squared Stat 2.5881
    df 1
    p-value 0.1077
    chi-squared Critical 3.8415


    Many thanks!

  4. #4
    TS Contributor
    Points: 4,575, Level: 43
    Level completed: 13%, Points required for next Level: 175

    Location
    Nashville, TN
    Posts
    177
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Hmm, what software did you use to produce these statistics? It appears that the test for normality is some sort of goodness of fit test. Since the p-value of this test was > 0.05,

    chi-squared Stat 2.5881
    df 1
    p-value 0.1077
    chi-squared Critical 3.8415

    it would indicate that we should fail to rejet the null hypothesis. That is, fail to reject the hypothesis that these data are normally distributed. Which means that the regression may be OK. Why is it that you found these to not be normally distributed?

    ~Matt

  5. #5
    Points: 4,215, Level: 41
    Level completed: 33%, Points required for next Level: 135

    Posts
    10
    Thanks
    0
    Thanked 0 Times in 0 Posts

    What about taking the log of the variable? Would that help to make it normally distributed?

+ Reply to Thread

           




Similar Threads

  1. Replies: 8
    Last Post: 02-19-2011, 06:26 PM
  2. Replies: 4
    Last Post: 12-27-2009, 03:54 PM
  3. Analysis of data with a small population
    By nrb10@psu.edu in forum Statistics
    Replies: 1
    Last Post: 11-26-2009, 08:39 AM
  4. Pareto Distributed Data
    By legend in forum Statistics
    Replies: 5
    Last Post: 06-05-2009, 11:36 AM
  5. Regression and correlation using population data
    By JamesA in forum Statistics
    Replies: 2
    Last Post: 12-05-2008, 07:51 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats