Correlate a continuous with a binary variable

Itai

New Member
#1
Hey everyone.

I have a simple problem and was hoping to get your advice with. I am trying to correlate a continuous variable with a binary one. Let's say the binary variable represents a Yes/No option and the continuous just a number between 1-100.

How should I go about doing that? I don't think the standard correlation formula can be applied here (Pearson correlation).

Any help would be much appreciated. Thanks!
 

Dragan

Super Moderator
#2
Hey everyone.

I have a simple problem and was hoping to get your advice with. I am trying to correlate a continuous variable with a binary one. Let's say the binary variable represents a Yes/No option and the continuous just a number between 1-100.

How should I go about doing that? I don't think the standard correlation formula can be applied here (Pearson correlation).

Any help would be much appreciated. Thanks!
Yes, the standard Pearson correlation applies here. What you have is referred to as a Point-biserial correlation. It's just a special case of the usual Pearson formulae. See here for more details:

http://en.wikipedia.org/wiki/Point-biserial_correlation_coefficient
 

Itai

New Member
#3
Thanks, I saw that, but what I was not sure is how to model Yes and No. Should I use 1 and 0? -1 and 1? I thought about using 1 and 0, but then if all of them are 0, the correlation is 0, which I am not sure makes sense (or does it?..:/).

Bonus question: I tired to look up Point-biserial in both Populis and Degroot and could not find anything about it in those two books. How come? Where can I find a more formal / academic explanation of that?

Again, thanks a tons!
 

Jake

Cookie Scientist
#4
I, for one, am actually heartened that you could not easily find information about "point-biserial correlation." The reason is because it is literally exactly the same thing as a two-sample t-test where you arbitrarily designate the continuous variable as the response variable. It is somewhat traditional to call this procedure a point-biserial correlation when it is not clear which variable represents the IV and which the DV, but that issue makes no difference either to the underlying mathematics or to the substantive interpretation of the test result. In my opinion, using a different name for the same procedure under arbitrary circumstances represents a needlessly confusing multiplication of terminology, and we would probably be better off just forgetting about the less often used term altogether.

If for some reason you are interesting in presenting the results of your t-test as a "point-biserial correlation," you can simply run the t-test as normal and then convert the resulting t to the point-biserial r coefficient using the following formula:
r = t^2/(t^2 + df)
 

Itai

New Member
#5
Thanks, Jake. I appreciate your help.

One thing that I could not understand is still the choice of values for the binary variable. Does a choice of 1 and 0 is better than -1 and 1? Does it make any difference at all?

Thanks again.
 
#7
What an engaging discussion. Thanks for the information.

I am interested in a partial correlation between a continuous and a binary variable. Say, I want to correlate height and gender, controlling for weight. Can I still use the "traditional" partial correlation formula? (I calculate Pearson using the pcor.test command in R)

thanks!
 
#8
hmm, I just realized that my covariate is a nominal variable, not a number.

Overall, I want to find the association between a continuous and binary variables, controlling for a nominal variable (association between grade and gender, controlling for teacher). Should I use a logistic regression? thanks.