# Thread: High Correlations with Very Different Distributions..?

1. ## Discounting Correlations.. Alternative Descriptives?

I am comparing a similar calculation for the same sample via two different methods. My goal is to show that the second measure offers more rich / complete data than the first so I can justify that it is potentially a better measure. My problem stems from the very high correlation of the two measures potentially suggesting the traditional measure may be nearly the same as the new measure.

Can someone give me arguments justifying why correlations are somewhat inapropriate for comparing these two measures and suggest what other descriptive statistics will help me demostrate that the second measure offers significantly more data than the first?

I don’t understand how the two measures in the two ordinary histograms (and last 3D histogram which combines the other two histograms) can be correlated so highly (see attachments for pictures). The correlation between the two measures is .92 and thus the first could be said to explain 84&#37; of the variance in the second. It seems that the second offers a much richer set of values in its distribution (In my sample the first measure calculates 135,000 zeros which the new method calculates some continuous set of values from 0 to about .6).

Obviously the utility of the measures are actually in their correlation with other variables of interest but it seems that such a high correlation between these two measures would translate into similar correlations of each with other variables (thus defeating my argument).

Thank you very much in advance for any help someone can give.

-Rob

2. Originally Posted by american_rob
My basic question: Why do I get such a high correlation between two measures with such different distributions?
-Rob
It can happens. For example it can happen if one variable is dervied(transformation) from other variable.
Could you provide the scatter plot of those variables.

3. ## Scatter Plots

The Scatter Plot is attached. The top scatter plot is from the same data I have described and the past graphics were from (I believe this is effectively a top down view of the 3-D histogram).

I included the second scatter plot to maybe better demostrate why this high correlation is a problem for me. This is the same thing as in my other example but with a different sample. Obviously in this example the traditional measure calcuates only 0's or 1's whereas the new measure calculates a continuum of scores. However, these two measures are correlated .89. I don't really understand how just knowing the 0 or 1 can explain such a large percent (almost 80%) of the variance in the continuus measure.

Remember my purpose is to show the new measure offers signitficantly more than the traditional measure. How can I argue that a correlation of old measure with new measure does not capture the significance of the new measure. Is there something else I should report instead of a correlation.

Thanks again for any help.

4. The correlation is because the two measure are related(obvious answer). It seems the traditional measure(0 &1) is a derived variable of new measure.
like
traditional measure = if new measure >0.5 then 1 else 0;

And
Originally Posted by american_rob

Remember my purpose is to show the new measure offers signitficantly more than the traditional measure. How can I argue that a correlation of old measure with new measure does not capture the significance of the new measure. Is there something else I should report instead of a correlation.
You can confidently say that the new measure offers significantly more than the traditional measure. Because it give more information like
you can compare most of all the obs/subjects ( In traditional measure there is no comparison within 0 or 1).
And it is very important.
Suppose if you wanted to pick top high measure 10 % obs( I am assuming percentage of 1's are more than 20%). Using traditional measure it is not possible. Only the new measure help in this situation

Regards
Vinu CT aka Vinux

5. ## Old vs. New

Vinux - thanks for the reply. Someone else I spoke to in person recomended something similar and suggested if I dropped the 0's the correlation might be much lower.

You are correct the two measures are related. Actually, the new measure is calculated by taking into account a complete taxonomy at all levels simultaneously with a variant of IDF weighting based on information theory whereas the old measure effectively considered only a single level of the same taxonomy in calculating the cosine similarity of two entities.

Becasue of this fact I would expect (and even desire) some correlation but a .92 correlation seems to suggest the new measure may not be especially worthwhile. Would standard deviations or some other descriptive statistics help support my argument that the new measure offers something significant beyond the original? How can I argue that correlations are innapropriate for consiering the value of one over the other if a reviewer asks for correlations?

Thanks again for any help.

-Rob

 Tweet

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts