Relationship between correlations


I am trying to figure out the relationship between correlations of variables where one of the variables defined as the difference between two other variables.

I have variables x and z, which are positively correlated. I define a new variable, x-z = y and find Corr(x,y). What I am interested in is what is the relationship between Corr(x,z) and Corr(x,y) = Corr(x, x-z).

I have worked out that Cov(x,y) = Cov(x,x-z) = Var(x) - Cov(x,z) (I think this is correct but not sure).

I am struggling with expression for the correlation. So far I have:

Corr(x,y) = Corr(x, x-z) = [Var(x) - Cov(x,z)]/[sqrtVar(x)]*sqrt(Var(x) + Var(z) - 2Cov(x,z)).

Basically what I am trying to figure out mathematically is, if I know that Cov(x,z) is positive, if I take the correlation between x and the difference between x and z does it necessarily follow that Corr(x, x-z) >= Corr(x,z) or something along those lines.

Thanks for your help!


Cookie Scientist
What can you say about corr(x, x-z)? Not much. Depending on the relative variances of x and z, this correlation could be strongly positive, strongly negative, or anything in between. This is easiest to see if we ignore the denominator of the correlation and just focus on the covariance (beginning with your correct right-hand-side expression):


So corr(x, x-z) is equal to the product of \(\sigma_X\) and \(\sigma_X-\rho_{XZ}\sigma_Z\). The term \(\sigma_X\) is necessarily positive, and we have assumed that \(\rho_{XZ}\) is positive as well. But the whole term \(\sigma_X-\rho_{XZ}\sigma_Z\) could still be a big positive number (if \(\sigma_X>>\sigma_Z\)) or a big negative number (if \(\sigma_Z>>\sigma_X\)), which would make the covariance (and hence the correlation) strongly positive or strongly negative, respectively.
Thanks very much for your help. That made sense, but I am confused about the intuition. Do you have any insight about how to think about this?

To make the motivation clearer I can explain the bit of research I'm doing and why I thought this question was relevant. I have survey data on i) people's estimates of the wealth distribution in the US by quintile (so what share of wealth is owned by the top 20%, next top 20%, etc) and ii) people's 'ideal' wealth distribution by quintile (i.e. how wealth would be distributed by quintiles in their 'perfect world').

Using this data I create estimated and ideal Gini coefficients for each individual (the Gini coefficient is just an index for measuring inequality, it takes a value between 0 and 1 with 0 being perfect equality, 1 perfect inequality). Variable x is estgini, variable z is idealgini. The true wealth distribution Gini of the US is 0.703 (when using the quintiles of the distribution, not the actual distribution).

I create a new variable, y = 0.703 - estgini = 'infogap', a measure of the quality of each individual's information about wealth inequality. I create a new variable, q = estgini - idealgini, i.e. the difference between inequality in what the individual estimates to be reality, and their preferred level of inequality. So essentially a measure of desire for redistribution of wealth.

What I am interested in from a theoretical point of view is Corr(y,q), i.e. is there an association between a person's information about wealth inequality and their desire for wealth redistribution. Corr(y,q) is strongly negative, around -0.5. Corr(x,z) is something like 0.4. So the motivation behind my original question was, to what extent is Corr(y,q) 'driven' by Corr(x,z)? Can I decompose Corr(y,q) into a component driven by Corr(x,z) and another component?

Sorry for the long read, if you managed to get through all that please let me know if you have any insight into whether I can make any inferences from Corr(y,q), or what is a good strategy for analysis.