(Li'l) theoretical question about correlations in regression...

spunky

Smelly poop man with doo doo pants.
#1
hey y'all people!??! how're y'all doooin' huh? i hope everyone had the most awesomest holidays!!! heh.

so... who else is feelin' the pinch this semester? oh man it's my last (course-intensive) semseter before finishin'up my master's (just waiting on the thesis heh) so i had to cram up 2 seminars on the stats dept and 2 courses on my home dept in education... in any case, best of lucks to everyone.

so last friday i got asked a somewhat intersting question that i haven't been able to quite figure out yet... it goes like this:

let's pretend that we have a regression that looks like \(Y_{1} = \beta _{0}+\beta_{1}X+\beta_{2}Z+\beta_{3}W+\epsilon\). now, as it usually happens in these cases, these variables have certain correlations so that \(r_{XY}, r_{XZ}, r_{XW}, r_{YZ}, r_{YW}, ...\) and you know, all of those are not zero.

say that i now have a reduced regression model that looks like \(Y_{2} = \beta _{0}+\beta_{1}X+\beta_{2}Z+\epsilon\), so it's the same as the previous one but without one predictor, \(W\).

the question would then be:

what would be the correlation between the ommitted predictor \(W\) and the predicted scores \(\widehat{Y_{2}}\) from the second, reduced model?

i am having a little bit of a hard time because there are a few too many correlations and i think the algebra's gonna get somewhat complicated if i try to sort it out by re-expressing \(\widehat{Y_{2}}\) in terms of its correlation with \(X\) and \(Z\) ...

oh god, i'm really hoping someone knows maybe a smart matrix algebra trick or some relationship (maybe through the reduced model's \(R^{2}\)) to simplify things before i kind of tackle this in full force...

thanks to everyone!
 

BGM

TS Contributor
#2
It seems that the problem is not related to the first regression model. Am I missing anything?

Also do you mean something like \( \hat{\mathbf{Y}} = \mathbf{H}\mathbf{Y} \), assuming you have \( n \) data. Or you are having other regression like the total least square?

I have tried but not successful to simplify the thing under these assumptions.
 

spunky

Smelly poop man with doo doo pants.
#3
hello BGM... thank you very much for taking the time to look at this...

let me re-express the question in terms of correlation matrices and see if it makes more sense.

under the first model, where you have \(Y\) as dependent and \(X,Z\) and \(W\) as predictors you would have a 4 X 4 correlation matrix among all those, right? now, if you do a regression predicting Y from X and Z (without the W predictor) and consider \(\widehat{Y}\) as a new variable you could write up a new, 3 X 3 correlation matrix that would have:

the correlation between Y and W (because you already know that from the original 4 X 4 correlation matrix from which we started)

the correlation between Y and \(\widehat{Y}\) (because you can get that from the square root of the \(R^{2}\), which you obtained from the previous regression predicting Y from X and Z)

and then i'd just be missing the correlation between \(\widehat{Y}\) and the predictor W that was not included in the original regression...

... i know there are boundaries for that correlation but that's as far as i've got. and, just as you said, at some point my algebra became so extensive and confusing that i'm reaching out for help, hoping someone knows some smart trick or has an insight into what to do...
 

Dason

Ambassador to the humans
#4
Y only matters to get the coefficients for the linear model. But Yhat is just a linear combination of X and Z once we have the parameters.

Code:
n <- 10000
X <- rnorm(n, 10, 3)
Z <- X + runif(n, 2, 10)
W <- .2*X - .7*Z + rnorm(n)

a <- .2
b <- 30

# Y doesn't even matter!  yhat is just a linear combination
# of X and Z.  Note I omit an intercept but it doesn't matter for correlation.
yhat <- a*X + b*Z


# cXY stands for the correlation between X and Y
cwx <- cor(W,X)
cwz <- cor(W,Z)
cxz <- cor(X,Z)
sx <- sd(X)
sz <- sd(Z)

# All in terms of correlations and variances of X, Z, W
(a*cwx*sx + b*cwz*sz)/sqrt(a^2*sx^2 + b^2*sz^2 + 2*a*b*cxz*sx*sz)

# Or without the simplified names...
(a*cor(W,X)*sd(X) + b*cor(W,Z)*sd(Z))/sqrt(a^2*var(X) + b^2*var(Z) + 2*a*b*cor(X,Z)*sd(X)*sd(Z))

# and it matches...
cor(W, yhat)
Note that we would need to replace 'a' and 'b' with the coefficients from the linear model Y ~ b0 + b1*x + b2*z...

Note that all I did was replace Yhat with a*X + b*Z and then used definitions and rules about manipulating covariances.

Also now that I think about it... it would be a lot easier to do this with matrix manipulations. But I'm too lazy to work that out right now.
 

spunky

Smelly poop man with doo doo pants.
#5
well... that is kind of what i wanted to ask you... so where did you start with the replacement and the covariance rules? like i see it works and i believe you... but how did you get to that final part where the product of correlations and regression coefficients just gets you that correlation... i've spent a good chunk of yesterday trying to tackle this problem and...oh wow!
 

Dason

Ambassador to the humans
#6
I just used a few facts.

1) Cor(X, Y) = Cov(X, Y)/(sd(X)sd(Y)) implies that Cov(X,Y) = sd(X)sd(Y)Cor(X, Y)

2) Cov(X, Y + Z) = Cov(X, Y) + Cov(X, Z)

3) Cov(a*X, b*Y) = abCov(X, Y)

Then I wrote Cor(W, Yhat) = Cor(W, a*X + b*Z) = Cov(W, a*X + b*Z)/(sd(W)*sd(a*X + b*Z)). I used (2) and (3) to break up the numerator and then (1) to get it into terms of correlations and standard deviations. I used variance rules to expand out the denominator.
 

Dragan

Super Moderator
#7
well... that is kind of what i wanted to ask you... so where did you start with the replacement and the covariance rules? like i see it works and i believe you... but how did you get to that final part where the product of correlations and regression coefficients just gets you that correlation... i've spent a good chunk of yesterday trying to tackle this problem and...oh wow!

Spunky: Look at Equation (4.7) on page 90 in my book. I think you can get your answer by using that equation and multiplying the result by the value of R for the reduced model.
 

spunky

Smelly poop man with doo doo pants.
#8
i feel so... so... SOOOO ****... never in my life (until now) had i ever considered that cor(x,y) *sd(x) *sd(y) = cov(x,y)..... oh god, i really, really feel the need for a smack-my-forehead emoticon.... lol

@Dragan i think i'll need to go grab your book from the library yet one more time .... at this point i think i'm gonna just add it to my amazon wishlist, hehe...
 

BGM

TS Contributor
#9
Previously I thought that spunky question means

\( \hat{Y}_2 = \hat{\beta}_0 + \hat{\beta}_1 X + \hat{\beta}_2 Z \)

and \( (\hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2) \) are some sort of least-square estimators which is a function of \( X, Z \). In this case it seems that it is very complicated.

But if spunky just mean \( \hat{Y}_2 = \beta_0 + \beta_1 X + \beta_2 Z \) and \( (\beta_0, \beta_1, \beta_2) \) is just the true parameter which are constants, then it is much simpler and just use the identities as what Dason suggested will be enough :)
 

Dragan

Super Moderator
#10
Previously I thought that spunky question means

\( \hat{Y}_2 = \hat{\beta}_0 + \hat{\beta}_1 X + \hat{\beta}_2 Z \)

and \( (\hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2) \) are some sort of least-square estimators which is a function of \( X, Z \). In this case it seems that it is very complicated.

But if spunky just mean \( \hat{Y}_2 = \beta_0 + \beta_1 X + \beta_2 Z \) and \( (\beta_0, \beta_1, \beta_2) \) is just the true parameter which are constants, then it is much simpler and just use the identities as what Dason suggested will be enough :)

Actually, there's a method that is much easier to compute the answer to Spunky's question than what Dason suggeted, BGM. :)