x=[0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5];

y=[0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 1 1];

As x increases, the percentage of 1's in y increases.

Raw data perspective:

A biserial correlation of this raw data gives something around .68, with a significant p-value. If I copy x and y 100 times (more data) I get the same correlation with even smaller p-value.

Why is the correlation so low (why isn't it 1?)? There is a strong correlation between success rate and x. Here is another way to look at this...

"Summary means" perspective:

The data is:

x=[0 1 2 3 4 5];

y_mean=[0 .2 .4 .6 .8 1];

I get a huge correlation (nearly 1).

Using the 30 points from the raw data, the confidence intervals are large and I wouldn't necessarily think the correlation is significant.

However, using the 3000 points (from repeating the data) I get much smaller confidence intervals.

People warn me against using summary means, but can I ever use them? View attachment 1292 View attachment 1293