correlation: "raw binary outcome" versus "summary means + confidence intervals"

#1
My outcome is binary (success/fail) and I'm trying to correlate it with continuous variable x. Maybe I'm trying to make a distinction between correlating x with success versus correlating x with success *rate*. Here is the problem I run into. Suppose my data were:

x=[0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5];
y=[0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1];

As x increases, the percentage of 1's in y increases.

Raw data perspective:
A biserial correlation of this raw data gives something around .68, with a significant p-value. If I copy x and y 100 times (more data) I get the same correlation with even smaller p-value.

Why is the correlation so low (why isn't it 1?)? There is a strong correlation between success rate and x. Here is another way to look at this...

"Summary means" perspective:
The data is:
x=[0 1 2 3 4 5];
y_mean=[0 .2 .4 .6 .8 1];
I get a huge correlation (nearly 1).

Using the 30 points from the raw data, the confidence intervals are large and I wouldn't necessarily think the correlation is significant.
However, using the 3000 points (from repeating the data) I get much smaller confidence intervals.

People warn me against using summary means, but can I ever use them? View attachment 1292 View attachment 1293
 

spunky

Doesn't actually exist
#3
Why is the correlation so low (why isn't it 1?)? There is a strong correlation between success rate and x.
please, do keep in mind that biserial correlation assumes that your binary-coded (in this particular case) y variable follows a latent normal density whose proportions are obtained from the proportions you observe. the real coefficient is calculated as the relationship of your numerical x and the latent proportions of your binary y so, in your case, even though you have an even split of 0-coded and 1-coded groups and you made it so that they increase perfectly, that does not necessarily translate into the latent bivariate normal density also reflecting a perfect correlation.

an interesting article i usually recommend people so they can look a little bit more into the theory behind biserial correlations is in: Kraemer, H. (1981). Modified Biserial Correlation Coefficients in Psychometrika, Vol. 46, 275-282


If I copy x and y 100 times (more data) I get the same correlation with even smaller p-value.
well, that's kind of to be expected. it's the same data over and over again so the relationship is the same, however, because you're increasing your sample size you're reducing the standard error of your estimate and, hence, you obtain a smaller p-value.


Using the 30 points from the raw data, the confidence intervals are large and I wouldn't necessarily think the correlation is significant.
However, using the 3000 points (from repeating the data) I get much smaller confidence intervals
same as last response... larger sample size = smaller standard error of the estimate =narrower confidence intervals.

x=[0 1 2 3 4 5];
y_mean=[0 .2 .4 .6 .8 1];
I get a huge correlation (nearly 1).
well, you're artificially reducing the range for the correlation to one that's quite conveninent in terms of its linearity, so such a high correlation is more a statistical artifact rather than a true correlation coefficient.


People warn me against using summary means, but can I ever use them?
in all my years both studying and doing stats, i have never seen anyone on the literature ever recommending this because you risk falling into an ecological fallacy. i've only seen people use them either for purely descriptive purposes or when they dont know any other way around them (and even then they acknowledge that the first drawback from their analysis is the use of summarised data instead of the complete data.
 
#4
I haven't deciphered that article yet, but it is by far the most relevant thing I've seen. And yes, those observations I shared are to be expected re: copying sample points - they were included for the sake of clarity.

Thank you!!!!!