Either or. Even if the "true" R^2 is 0 (which is the case for X ~ Unif(-a, a), Y = X^2) there can be dependence.
I don't have emotions and sometimes that makes me very sad.
But that isn't true. The R^2 is per definition not zero in that case. It can be estimated to be zero, but in that case the model used to predict Y is seriously flawed.
I agree on this: If R^2 is estimated to be 0, then we cannot draw the conclusion that there are no dependence. But if the true unknown R^2 is 0, then there are no dependence. I would say that R^2 is 1 in the example you refer to, because there is a perfect relationship between the variables described. But if you try to fit a linear model, then the observed R^2 will be far from 1.
Consider the following distribution:
In this case the best "regression" we could come up with is predicting y = 0 regardless of x. But there is still dependence here. For if we know x = 0 then we KNOW y = 0. If we know x = 1 then y could be either -1 or 1. So the distribution of Y depends on the value of X. So R^2 is 0 but there is still dependence.
I don't have emotions and sometimes that makes me very sad.
GretaGarbo (07-31-2012)
A friend of mine surprised me when he said:
“You can choose to get as large R^2 as you want.“
“What?!” I said.
“If there is a linear relationship between x and y and you can design where to put the x-values, then just by stretching out the x-values far enough you will get a large enough R^2 value“, he said.
Another aspect is that if you have an observational study and there has not been very much variation in the x-values – the x-values have been roughly constant (as often happens in observational studies) – then the R^2 will be low. That does not mean that the model is bad. It can be a good description of reality. A good model is a model that fits to the data. Not if R^2 is high or low. Lack of fit measures are far more important than R^2.
The residual variance has an influence on R^2 (by increasing the residual sums of squares). So you can make a two-by-two “table” or graph with high and low variation in the x-values and with high and low residual variation. I think that is more important to think of than the R^2.
I would be primary concerned by the parameter estimates and if they are significant, the standard deviation in the residuals and lack-of-fit-measures.
I don’t think that R^2 is understood by 99 percent. I think it is overemphasized and misused.
Besides, I think it gives increased confidence if someone talks both about a models strengths AND weaknesses. This is valid for statistical investigations and used car sellers.
Greta! Go get one more post and then you'll have a surprise on the TalkStats homepage for you.
I don't have emotions and sometimes that makes me very sad.
The usual is to think of regression parameters like “beta” and “sigma” to have a population value that can be estimated from a sample.
But does R^2 have a population value? I have never heard of that.
Think of a linear regression model with a nonzero slope (beta).
Imagine that a first experiment is having the x-values in a narrow range. That will give one R^2 value.
Imagine a second experiment with exactly the same parameters but with the x-values in a wider range. That will give a higher R^2 values for exactly the same parameter values, that is, for the same population beta and sigma values.
No, I don’t think it is meaningful to think of R^2 as population parameters.
I think of R^2 as a simple descriptive of the data at hand.
Tweet |