Correlating A with B, half of subjects has a 0 value for A. Should I exclude them?

#1
[solved] Correlating A with B, half of subjects has a 0 value for A. Should

(sorry for the doublepost, I feel like I may have posted this in the wrong forum in the first place, feel free to move/delete the other thread; it has no replies.)

Hey guys.

I have a basic stat question. I basically have two variables I'd like to correlate (and do other analyses eventually).

The first one "exposition level" is a continuous variable that can range from 0 to *** (figuratively to infinity) and is how much someone has been exposed to a certain substance (the higher, the more this person was exposed). The second one, "Test score" is also a continuous variable and is basically a Z-score derived from a test performance that usually range anywhere from -3.0 to +3.00 (although it can be lower and higher, you know, Z-scores...).

Let's say I want to study the correlation between the level of exposition and the test score, should I remove all my cases that were never exposed, i.e. all those with a value of 0 on the exposition level? Would that be a statistical heresy to exclude all cases that were not exposed (because as you can see in the scatterplot, the distribution is skewed on the "0" side for exposition).


Thanks and pardon my english (feel free to correct me, it helps).

Regards
 
Last edited:
#2
Re: Correlating A with B, half of subjects has a 0 value for A. Should I exclude the

Unless you have good reason to suspect that there was some error or an experiment went wrong, it's generally not appropriate to exclude observations just because they take on a certain value. The scatterplot is a good first step, this gives you some idea of the relationship without even calculating a correlation coefficient.
 
#3
Re: Correlating A with B, half of subjects has a 0 value for A. Should I exclude the

Ok thanks for the advice. I was wondering if all those subjects that were not exposed (i.e. 0 on the X axis) skewed so much the relationship between the two variables that it hid the "real" relationship. In my study, it is definitely normal that almost half the subjects will never be exposed to the toxin. My hypothesis is that they will perform better on the test score variable if they were never exposed.

I guess if there is a trend in which they are grouped in the upper left part of the scatterplot (like in this one), this is good. With more data, if my hypothesis is confirmed, data for exposed subjects will be distributed along/near the upper left diagonal line going to the lower right. So in that way, I guess you are right that it makes sense to include those subjects.

But if I happen to see a "non-linear" distribution trend, should I stay away from classic parametric analyses? (what kind of regression can be run on data in which there is an underlying non-linear relationship?)
 
#4
Re: Correlating A with B, half of subjects has a 0 value for A. Should I exclude the

It sounds like you are interested in whether the level of exposure is associated with a given outcome. One way to look at the association would be to calculate the mean, median, or some other summary statistics of the test score for certain levels of exposure, such as 0, >0-5, >5-10, etc. Though it looks like you don't have many observations as exposure increases. You could create a categorical "dummy" variable in which you define exposure as 0=unexposed or 1=exposed, then you can include in the regression model this variable, exposition, and an interaction between this variable and exposition (variable X exposition). This would allow for two separate slopes and intercepts depending on whether exposed or unexposed. Alternatively, if there is some meaningful way to categorize the test score, you could perform logistic regression to see how exposure affects the odds of receiving a certain score on the test.
 
#5
Re: Correlating A with B, half of subjects has a 0 value for A. Should I exclude the

Those are excellent suggestions, thank you. At first, I wanted to avoid making groups because I did not want to lose power (and had no theoretical reasons to choose a threshold value over another). Thank you very much!