Hypotheses testing if R²=1

yosi

New Member
#1
Dear all,

Since the R² value of my regression model is one, I can neither conduct a t-test nor a F-test. Is this a problem with regard to hypotheses testing? I am asking since I cannot "proof" via the above-mentioned tests that the coefficients are statistically significant...

Many thanks for taking the time to post an answer!


Regards
Yosi
 

Karabiner

TS Contributor
#2
What does your model look like (how many and what kind of
variables are in the model), and how large is your sample size?

With kind regards

K.
 

hlsmith

Omega Contributor
#3
I have never seen this before, but could imagine that it could exist under the right circumstances.

Is one of your predictors just a proxy or another form of the dependent variable?
 

yosi

New Member
#4
My model looks like this:

Y=b1*x1+b2*x2+b3*x3

The values of Y are defined as the sum of the values of the predictor variables, so it's no surprise that R² equals 1. With the help of the regression analysis I tried to identify the predictor variable with the strongest influence on Y.

The sample size is n=15000.
 

hlsmith

Omega Contributor
#5
Wouldn't the largest variables contributing to the sum be the strongest predictor? Can you provide a context, so we can understand why you are doing this?

Can you calculated some type of average contribution based on proportion of contribution, then test that the proportions are not equal?
 

yosi

New Member
#6
To give an example (3 out of 15000 observations):

Y; x1; x2; x3
10; 5; 3; 2
12; 2; 2; 8
15; 5; 1; 9

The question is: Which predictor is the strongest/weakest one if all obervations are taken into account? I think that a regression analysis is a suitable tool for addressing this question. Please correct me if I am wrong.

Additional remark: Of course it would be necessary to compare the standardized coefficients since the unstandardized coefficients are all equal to one.
 
Last edited:

rogojel

TS Contributor
#7
hi,
how about dividing the three variables by Y to get the percentage contributions and doing an ANOVA, or an appropriate non-parametric test? This does not look like a regression problem to me.

regards
rogojel
 

hlsmith

Omega Contributor
#8
By strongest influence you just mean the greatest contribution, correct. As stated before why can't you just determine their average proportional contribution (percentage)?
 

noetsi

Fortran must die
#9
I would imagine Y is a perfect linear combination of your X in this case which is why your model won't run. I am not sure why you would be interested in testing a model where Y is a combination of the X by definition. You know the X have to drive the Y and nothing else but these specific X influence it - you have defined it that way. So there is no point in testing this.

You could probably simulate changing the value of one X while holding each other X constant. Then find out from this simulation which of the variables has the greatest influence.
 
#10
I would imagine Y is a perfect linear combination of your X in this case which is why your model won't run. I am not sure why you would be interested in testing a model where Y is a combination of the X by definition. You know the X have to drive the Y and nothing else but these specific X influence it - you have defined it that way. So there is no point in testing this.

You could probably simulate changing the value of one X while holding each other X constant. Then find out from this simulation which of the variables has the greatest influence.
noetsi is absolutely right. Regression is used only when the function relating Y to X's is not known. In your case, you know the exact contribution of each predictor X. To compare the contributions of predictors X1 and X2 you can compare Corr(Y,X1)^2 and Corr(Y,X2)^2. This approach compares the amount of variation in Y explained by X1 to that explained by X2.
 

noetsi

Fortran must die
#11
noetsi is absolutely right. Regression is used only when the function relating Y to X's is not known. In your case, you know the exact contribution of each predictor X. To compare the contributions of predictors X1 and X2 you can compare Corr(Y,X1)^2 and Corr(Y,X2)^2. This approach compares the amount of variation in Y explained by X1 to that explained by X2.
That is a lot simpler, and better, approach than I suggested. One problem is that the multiple regression probably won't run and unless the X are not related to each other than running bivariate correlations might distort the true impact of any indivdual predictor.
 

hlsmith

Omega Contributor
#12
I think noesti may be trying to say that breaking it apart this way may not address possible collinearity between predictors. It may not seem like it, but two of the covariates may move together. E.g., say when A is bigger so is B, while C is smaller. We don't know the content of your problem, so we only present plausible theoretical issues.
 
#14
Thanks for this vivid discussion, I really appreciate your contributions. I understand that there are alternative - maybe even better - ways to identify the contribution of each predictor. But nevertheless I still think that a regression analysis is not necessarily the wrong tool.

"I would imagine Y is a perfect linear combination of your X in this case which is why your model won't run."

I know that this is most unusual, but why is it a (statistical) problem? R² = 1 is an accepted value.

"Regression is used only when the function relating Y to X's is not known."

Only then? Why? I am interested in the amount of contribution of the X's - considering all observations. Why is a comparison of the standardized coefficients a wrong approach?
 

rogojel

TS Contributor
#15
hi,
just my five cents:
your equation is precisely Y=x1+x2+x3 with all coefficients being 1 and no random term. The least squares is just not the right mathematical model, imho. From the equation point of view all terms have the same contribution.

Now, in reality it could happen that x1 is generally higher then x3, for instance. This would be a simple ANOVA type of question and has nothing to do with the fact that the 3 xs are bound together in such an equation.

regards
rogojel
 

noetsi

Fortran must die
#16
I believe, it has been several years since I read this, that when Y is a perfect linear combination of X there is perfect (not high) collinearity. It is impossible to generate a solution when this occurs so the software can not generate an answer.
 

Dason

Ambassador to the humans
#17
I believe, it has been several years since I read this, that when Y is a perfect linear combination of X there is perfect (not high) collinearity. It is impossible to generate a solution when this occurs so the software can not generate an answer.
Incorrect. You can fit the model but you can't do inference.
 
#18
hi,
just my five cents:
your equation is precisely Y=x1+x2+x3 with all coefficients being 1 and no random term. The least squares is just not the right mathematical model, imho. From the equation point of view all terms have the same contribution.

Now, in reality it could happen that x1 is generally higher then x3, for instance. This would be a simple ANOVA type of question and has nothing to do with the fact that the 3 xs are bound together in such an equation.

regards
rogojel
That makes sense! Thank you!
 

noetsi

Fortran must die
#19
This is what John Fox says about perfect collinearity (you can find similar comments in other books).

"Thus...[if perfect collinearity exists] than the denominators of b1 and b2 in Equation 3.2 is zero and these coefficients [the slopes] are undefined. More properly, there is an infinity of pairs of values of b1 and b2 that satisfy the normal equation." p 10 in Regression Diagnostics

SAS will not even run when this occurs. SPSS gives an error message
 

Dason

Ambassador to the humans
#20
Note that in this case all of the predictors are linearly independent of one another so we don't fall into that particular case. We don't have perfect collinearity among the predictors.