# Puzzled by Correlation

#### idealliu

##### New Member
This is a real problem although I'm using a made-up statistic (# of apples) for this.

Say I am interested in how many apples people eat in a year. This varies by many factors. First I look at the country difference (the US vs. Canada). From historical data of the average number of apples consumed every year, the correlation between these two countries is, say 0.7.

Then I want to look more closely, and this time also by age. By age I look at older (>=65) and younger (<65) groups and found the correlation coefficient is 0.8 in the US and 0.9 in Canada.

Now I compare younger group in the US to the older group in Canada. Without running through historical data again is there a way to calculate the correlation between these two sub-groups based on the above correlation coeffiecients?

I need this because there could be other variables I need to factor in, so that a comprehensive correlation coefficient matrix can be very large. And there are not enough data to support certain variables. If I can use a formula to derive the correlation as it goes it'll be much more efficient.

Thank you.

#### Jake

##### Cookie Scientist
I think first you need to clarify what you mean by "the correlation between these two countries is 0.7." This doesn't seem to make sense. If we have just 2 values -- number of apples per year for US, and number of apples per year for Canada -- we can't compute a correlation with just 2 values. What would such a correlation even mean in this case? Maybe you mean something like a correlation across time (so that when apple consumption is high in the US, it also tends to be high in Canada)? But again note that we require more than 2 observation to compute this correlation, specifically, we would require multiple repeated observations of apple consumption over time for both countries. Basically just please clarify.

#### idealliu

##### New Member
Hi, Jake.

Your guess is right. The correlation meant to be the correlation coefficient calculated using two set of numbers of the historical annual apple consumption.

Apples eaten per person
US Canada
1950 150 151
1951 152 149
1952 153 152
....

I think first you need to clarify what you mean by "the correlation between these two countries is 0.7." This doesn't seem to make sense. If we have just 2 values -- number of apples per year for US, and number of apples per year for Canada -- we can't compute a correlation with just 2 values. What would such a correlation even mean in this case? Maybe you mean something like a correlation across time (so that when apple consumption is high in the US, it also tends to be high in Canada)? But again note that we require more than 2 observation to compute this correlation, specifically, we would require multiple repeated observations of apple consumption over time for both countries. Basically just please clarify.

#### noetsi

##### Fortran must die
Personally it seems to me that you want to know how some predictor, say country, predicts apple sales in this made up example. You are really looking at how apple sales varies with country here (and age etc). In regression and other methods this is referred to explained variation in the dependent variable which is assumed to be related to the predictor (although normally you can not prove it).

Regression allows you to find out how much of the variation (essentially the correlation as you mean it) in the predicted variable you can "explain" with a predictor variable controlling for other factors. So you can tell how much impact nation of origin has on apple sales controlling for age, gender etc. Essentially you do this by ignoring variation in apple sales that more than one variable can explain (where the variation between predictors overlaps). Only the variation in apples that can be associated with say nation of origin not explained by any other variable in considered in stating the impact of nation on apple sales.

Which is a much simplified explanation of what you are doing. Regression or ANOVA (or related method) seems more useful for what you are doing than simple univariate correlation studies.

#### Jake

##### Cookie Scientist
Okay, I think I understand the situation. I think that there is not enough information to solve for the correlation that you mentioned.

Certainly if taken exactly as stated there is not enough information. If however we assume that we have some additional basic information -- specifically, that we know the variances of the 4 sub-groups, and the proportion of Canadians that are young, and the proportion of Americans that are old -- then the only further information that would be needed is about the other correlations among the 4 sub-groups. With 4 groups there are 6 intercorrelations, i.e., 6 unknowns in the system. In the statement of your problem you assume that we know 3 of these correlations. And I think we can add another constraint on the system by requiring the determinant of the correlation matrix to be nonnegative. So that is 4 equations and 6 unknowns. So it seems that we need more information. I wonder if others agree with this informal analysis.

#### idealliu

##### New Member
Thank you, guys. It seems that this is a bit more complicated than I originally thought.