+ Reply to Thread
Results 1 to 11 of 11

Thread: Some IVs are normally distributed but some aren't, how to deal with it?

  1. #1
    Points: 164, Level: 3
    Level completed: 28%, Points required for next Level: 36

    Posts
    4
    Thanks
    1
    Thanked 0 Times in 0 Posts

    Some IVs are normally distributed but some aren't, how to deal with it?




    Hi there,

    In my study I had two groups of people (depressed patient vs. healthy control) and I compared their neural physiologic activity (EEG power), cognitive ability (TMT task), and total physical activity. Afterwards, I wanted to see how well these dependent variables could assign which group (patient, control) the subjects belong to.

    So here comes the questions, I have only EEG power and cognitive ability data are normally distributed but not the physical activity. Should I then use non-parametric test for all DVs? Or should I mix the independent sample t-test and say, Mann-Whitney test according to the normal distribution outcome?

    Also, when i want to do the categorisation, that would be either linear discriminant analysis or logistic regression and I can opt for. I'm aware that normally distribution of IVs is one of the prerequisites for using LDA, but in my case some IVs are but some aren't. Should I then consider LR or should I run the test separately?

    Is there any better way to deal with mixture of normally distributed DVs and IVs?

    Thanks in advance.

  2. #2
    TS Contributor
    Points: 18,889, Level: 87
    Level completed: 8%, Points required for next Level: 461
    CowboyBear's Avatar
    Location
    New Zealand
    Posts
    2,062
    Thanks
    121
    Thanked 427 Times in 328 Posts

    Re: Some IVs are normally distributed but some aren't, how to deal with it?

    OLS regression does not assume normally distributed IVs or DVs. It just assumes normal errors, and is robust to that assumption with any reasonable sample size.

    See http://www.talkstats.com/showthread....y-TalkStatters! for a relevant resource.

    LDA assumes multivariate normality of all the IVs.
    Matt aka CB | twitter.com/matthewmatix

  3. #3
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Some IVs are normally distributed but some aren't, how to deal with it?

    Formally it makes no difference at all what the IV distribution is. Non normal residuals will influence the statistical test (but not bias the slopes) but as CWB noted with large sample sizes even this is not a major issue because of the central limit theorem.

    However, if you have 90 plus percent of the values at one level of a dummy variable this will attenuate the mean due to lack of variation. This is a different concept than normality.
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  4. #4
    Points: 164, Level: 3
    Level completed: 28%, Points required for next Level: 36

    Posts
    4
    Thanks
    1
    Thanked 0 Times in 0 Posts

    Re: Some IVs are normally distributed but some aren't, how to deal with it?

    @ CowboyBear, thanks for your answer and link! If I understood you right, you'd suggest a LR test for me since the multivariate normality of all IVs is violated in my case (not all IVs are normally distributed)?

    @noetsi, thanks for the response but I don't quite understand you. Perhaps my title was a bit misleading. My first question was, if my DVs are not all normally distributed, should I use independent sample t-test to compare between group or should I use non-parametric test? After testing the DVs between the group, I'd like to see how good these DVs can classify my subjects. So I'd run a classification test, as far as I'm aware of, LR and LDA are the options I can choose. The DVs for the t-test would be now IVs for classification test. But LDA assume multivariate normality as Cowboybear said, does it mean that I should opt for LR?

    Another question is, I have only 39 subjects (19 patients, 20 controls), from the resources I read, sample size wouldn't play a role on deciding which test to use. Is this true?
    Last edited by Ping Koo; 02-23-2017 at 11:24 AM. Reason: Needed to ask more questions

  5. #5
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Some IVs are normally distributed but some aren't, how to deal with it?

    Your iv or your dv? If its your IV it does not matter at all. For your DV you should look at the residuals, not the DV and see if they are normally distributed. If they are, then there is no problem. Multivariate normality is an assumption of the residuals not the variables - something that seems to confuse people [and in fairness many text which talk about variables when they should talk about residuals].

    CWB and I am saying the same thing. Multivariate normality is a property of the residuals only. See if they are normal. If they are than there is no problem. If they aren't then you probably should not use a method that assumes normality, although in practice methods such as regression are very robust to this assumption if you have a large sample size. People disagree what this means in practice. I have seen as little as 30 mentioned although a hundred plus is better.

    Sample size does matter in practice, if not theory, because with a larger sample size the central limit theorem means it does not matter much if your data is not normal. 39 would probably work, it would be better to have more of course.
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  6. #6
    TS Contributor
    Points: 18,889, Level: 87
    Level completed: 8%, Points required for next Level: 461
    CowboyBear's Avatar
    Location
    New Zealand
    Posts
    2,062
    Thanks
    121
    Thanked 427 Times in 328 Posts

    Re: Some IVs are normally distributed but some aren't, how to deal with it?

    Quote Originally Posted by noetsi View Post
    Multivariate normality is an assumption of the residuals not the variables - something that seems to confuse people [and in fairness many text which talk about variables when they should talk about residuals].

    CWB and I am saying the same thing. Multivariate normality is a property of the residuals only.
    Hold up...

    OLS regression assumes univariate normality of each of the error terms.

    Linear discriminant analysis assumes multivariate normality of the independent variables.

    (I think things are getting a bit muddled here because OPs post mentioned two different analyses).

    OP, as noetsi and I are saying, OLS regression (or a special case like t-test) is fine for your first analysis. (Or at least unless some other assumption violation is present). But yes, for the discrimination analysis, I would typically prefer logistic regression. It doesn't have as stringent a set of assumptions as discriminant analysis. It also produces a regression equation that clearly indicates how each of the IVs relate to the DV (when controlling for the others). DA doesn't produce a regression equation that has a very intuitive meaning.
    Matt aka CB | twitter.com/matthewmatix

  7. The Following User Says Thank You to CowboyBear For This Useful Post:

    Ping Koo (02-24-2017)

  8. #7
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Some IVs are normally distributed but some aren't, how to deal with it?

    I actually thought all GLM statistical models assumed normality of the residuals. I did not realize this was tied specifically to OLS.

    If the op uses logistic regression they would be wise IMHO to pay attention to the odds ratios rather than the slopes the later being hard to interpret by most people...
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  9. #8
    Points: 1,741, Level: 24
    Level completed: 41%, Points required for next Level: 59

    Posts
    230
    Thanks
    37
    Thanked 68 Times in 59 Posts

    Re: Some IVs are normally distributed but some aren't, how to deal with it?

    Quote Originally Posted by noetsi View Post
    Your iv or your dv? If its your IV it does not matter at all. For your DV you should look at the residuals, not the DV and see if they are normally distributed. If they are, then there is no problem. Multivariate normality is an assumption of the residuals not the variables - something that seems to confuse people [and in fairness many text which talk about variables when they should talk about residuals].

    CWB and I am saying the same thing. Multivariate normality is a property of the residuals only. See if they are normal. If they are than there is no problem. If they aren't then you probably should not use a method that assumes normality, although in practice methods such as regression are very robust to this assumption if you have a large sample size. People disagree what this means in practice. I have seen as little as 30 mentioned although a hundred plus is better.

    Sample size does matter in practice, if not theory, because with a larger sample size the central limit theorem means it does not matter much if your data is not normal. 39 would probably work, it would be better to have more of course.
    Two comments:
    1) A small point (technicality)-- the assumption of normality doesn't apply to the residuals as those are estimates of the errors. The errors are what the assumption applies to-- (practically) an inconsequential remark about your post (aside from the MVN vs univariate normality, as CWB mentioned).
    2) With regards to the central limit theorem-- the threshold of "sufficiently large" is generally around 30 for underlying distributions that don't deviate very far from normality. The more nonnormal the underlying distribution is, the larger "sufficiently large" becomes, often in the thousands (it can even fail in some instances, such as in the case of a cauchy distribution).

  10. #9
    Points: 1,741, Level: 24
    Level completed: 41%, Points required for next Level: 59

    Posts
    230
    Thanks
    37
    Thanked 68 Times in 59 Posts

    Re: Some IVs are normally distributed but some aren't, how to deal with it?

    Quote Originally Posted by CowboyBear View Post

    Linear discriminant analysis assumes multivariate normality of the independent variables.
    I would only add that MVN is assumed for the independent variables within each group of the DV.

  11. The Following User Says Thank You to ondansetron For This Useful Post:

    CowboyBear (02-23-2017)

  12. #10
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: Some IVs are normally distributed but some aren't, how to deal with it?

    Quote Originally Posted by ondansetron View Post
    Two comments:
    1) A small point (technicality)-- the assumption of normality doesn't apply to the residuals as those are estimates of the errors. The errors are what the assumption applies to-- (practically) an inconsequential remark about your post (aside from the MVN vs univariate normality, as CWB mentioned).
    2) With regards to the central limit theorem-- the threshold of "sufficiently large" is generally around 30 for underlying distributions that don't deviate very far from normality. The more nonnormal the underlying distribution is, the larger "sufficiently large" becomes, often in the thousands (it can even fail in some instances, such as in the case of a cauchy distribution).
    Its the residuals you actually look at. Since the errors are not actually know in practice you can never analyze them. But in theory I am sure you are right.

    I am not sure the central limit has a theoretical point at which it applies. The more data you have the better.
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  13. #11
    Points: 1,741, Level: 24
    Level completed: 41%, Points required for next Level: 59

    Posts
    230
    Thanks
    37
    Thanked 68 Times in 59 Posts

    Re: Some IVs are normally distributed but some aren't, how to deal with it?


    Quote Originally Posted by noetsi View Post
    Its the residuals you actually look at. Since the errors are not actually know in practice you can never analyze them. But in theory I am sure you are right.
    Right, the residuals are our observable estimates of the errors, which is why I mentioned they're (residuals) an estimate. It's similar to when we claim in Ho that mu = 60 but we use the sample mean of 58 to assess the claim-- our assumption pertains to mu, and it isn't saying that x-bar is 60 (but we look at x-bar to assess the "reasonableness" of the assumption pertaining to mu). Overall, my point was a small one: the theory applies to the true and often unknown values (errors), rather than the residuals themselves (observable estimates). I was trying to provide clarity in case someone stumbles upon this while studying for an exam that might test the assumptions.

    Quote Originally Posted by noetsi View Post
    I am not sure the central limit has a theoretical point at which it applies. The more data you have the better.
    As far as I know there isn't a single theoretical cut off. My comment was in regards to the rule of thumb pertaining to the number 30. There is a bunch of research (simulations, if I recall) that suggests a sample size near 30 is when many nonnormal, but close to normal, distributions fall in line with regards to an approximately normal sampling distribution (and diminishing returns start to set in from a practical stance). Other research suggests, though, that highly skewed or odd distributions need sample sizes in the thousands before the sampling distribution of the sample statistic starts to look reasonably normal. I threw in the caveat that despite a large sample size, there are cases where the CLT just doesn't apply (Cauchy). As you were saying, though, in most practical situations "bigger is better" works when sampling (but we should be aware when that won't be a reasonable fix, and trying to force normality on the sampling distribution would be a waste of time and resources).

+ Reply to Thread

           




Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats