+ Reply to Thread
Results 1 to 5 of 5

Thread: Correlating A with B, half of subjects has a 0 value for A. Should I exclude them?

  1. #1
    Points: 948, Level: 16
    Level completed: 48%, Points required for next Level: 52

    Posts
    7
    Thanks
    4
    Thanked 0 Times in 0 Posts

    [solved] Correlating A with B, half of subjects has a 0 value for A. Should




    (sorry for the doublepost, I feel like I may have posted this in the wrong forum in the first place, feel free to move/delete the other thread; it has no replies.)

    Hey guys.

    I have a basic stat question. I basically have two variables I'd like to correlate (and do other analyses eventually).

    The first one "exposition level" is a continuous variable that can range from 0 to *** (figuratively to infinity) and is how much someone has been exposed to a certain substance (the higher, the more this person was exposed). The second one, "Test score" is also a continuous variable and is basically a Z-score derived from a test performance that usually range anywhere from -3.0 to +3.00 (although it can be lower and higher, you know, Z-scores...).

    Let's say I want to study the correlation between the level of exposition and the test score, should I remove all my cases that were never exposed, i.e. all those with a value of 0 on the exposition level? Would that be a statistical heresy to exclude all cases that were not exposed (because as you can see in the scatterplot, the distribution is skewed on the "0" side for exposition).


    Thanks and pardon my english (feel free to correct me, it helps).

    Regards
    Attached Images  
    Last edited by nightale; 07-12-2014 at 03:26 PM.

  2. #2
    Points: 3,730, Level: 38
    Level completed: 54%, Points required for next Level: 70

    Posts
    155
    Thanks
    7
    Thanked 30 Times in 29 Posts

    Re: Correlating A with B, half of subjects has a 0 value for A. Should I exclude the

    Unless you have good reason to suspect that there was some error or an experiment went wrong, it's generally not appropriate to exclude observations just because they take on a certain value. The scatterplot is a good first step, this gives you some idea of the relationship without even calculating a correlation coefficient.

  3. The Following User Says Thank You to Disvengeance For This Useful Post:

    nightale (07-12-2014)

  4. #3
    Points: 948, Level: 16
    Level completed: 48%, Points required for next Level: 52

    Posts
    7
    Thanks
    4
    Thanked 0 Times in 0 Posts

    Re: Correlating A with B, half of subjects has a 0 value for A. Should I exclude the

    Ok thanks for the advice. I was wondering if all those subjects that were not exposed (i.e. 0 on the X axis) skewed so much the relationship between the two variables that it hid the "real" relationship. In my study, it is definitely normal that almost half the subjects will never be exposed to the toxin. My hypothesis is that they will perform better on the test score variable if they were never exposed.

    I guess if there is a trend in which they are grouped in the upper left part of the scatterplot (like in this one), this is good. With more data, if my hypothesis is confirmed, data for exposed subjects will be distributed along/near the upper left diagonal line going to the lower right. So in that way, I guess you are right that it makes sense to include those subjects.

    But if I happen to see a "non-linear" distribution trend, should I stay away from classic parametric analyses? (what kind of regression can be run on data in which there is an underlying non-linear relationship?)

  5. #4
    Points: 3,730, Level: 38
    Level completed: 54%, Points required for next Level: 70

    Posts
    155
    Thanks
    7
    Thanked 30 Times in 29 Posts

    Re: Correlating A with B, half of subjects has a 0 value for A. Should I exclude the

    It sounds like you are interested in whether the level of exposure is associated with a given outcome. One way to look at the association would be to calculate the mean, median, or some other summary statistics of the test score for certain levels of exposure, such as 0, >0-5, >5-10, etc. Though it looks like you don't have many observations as exposure increases. You could create a categorical "dummy" variable in which you define exposure as 0=unexposed or 1=exposed, then you can include in the regression model this variable, exposition, and an interaction between this variable and exposition (variable X exposition). This would allow for two separate slopes and intercepts depending on whether exposed or unexposed. Alternatively, if there is some meaningful way to categorize the test score, you could perform logistic regression to see how exposure affects the odds of receiving a certain score on the test.

  6. The Following User Says Thank You to Disvengeance For This Useful Post:

    nightale (07-12-2014)

  7. #5
    Points: 948, Level: 16
    Level completed: 48%, Points required for next Level: 52

    Posts
    7
    Thanks
    4
    Thanked 0 Times in 0 Posts

    Re: Correlating A with B, half of subjects has a 0 value for A. Should I exclude the


    Those are excellent suggestions, thank you. At first, I wanted to avoid making groups because I did not want to lose power (and had no theoretical reasons to choose a threshold value over another). Thank you very much!

+ Reply to Thread

           




Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats