Low R-squared, but my variables fit expectations

So I ran a regression on 3 different sets of data estimation techniques versus actual survey data. The dependent variable was the non-survey estimation technique values and the independent variable was the survey data. The null hypotheses was that the intercept term would be zero and the beta (coefficient) would be 1 (both would be the case in instances of a "perfect fit" for my models) with the alternative hypotheses being otherwise.

Two of my models rejected the null hypotheses and that's not really a surprise, we expected the non-survey estimations to be way off. The third model failed to reject the nulls and both the intercept and the beta had fantastic values with really low t values indicating there was no statistically significant differences between the two (t values of 1.88 and 1.99, respectively). However, the coefficient of determination, R2 was extremely low >0.05 (this was the case of all three regressions, but my main concern is with the model that performed well).

My big question is how to explain why the R2 was so low, while the coefficient and intercept fit expectations. I ran the data through a few other tests. The other goodness of fit test, I (detailed below) showed poor goodness of fit. Theil's inequality coefficient, U indicated a naive shot-in-the-dark could have performed better than the models (U > 2), with the bias proportion (UM) being quite low (0.005), but the variance proportion (US) was moderately high (.404).

where: X = the survey variable and X* is the non-survey variable
Interpretation is values of I closer to 0 indicate better goodness of fit (0 = perfect fit, i.e. Xi = X*i), higher values indicate poorer goodness of fit.

Now, my inclination is the low R2 results from the high amount of variance as indicated by the US value. Another possible explanation might be that the observations are all close to zero and thus, there may be weighting issues with the small variables, but I'm not sure if that's as problematic given that the bulk of the observations are all quite small.

So does the high amount of variance seem like a plausible explanation or am I was off? If I'm off, what other explanations seem appropriate?


Have you checked any of the diagnostic plots for unusual structure? You might be looking for fat tails in your histogram of residuals which would indicate a bimodal population. In such a case, the estimated coefficient might be what you expect, but its really the average of two distinct populations.

R-squared alone is simply a measure of the error in the regression over the total regression. A low r-sqr would indicate a significant amount of unexplained variance in the model, but it could be caused by a lot of things. For example, the population might be bimodal, the model might begin making systematic errors at the beginning, middle or end of the range of values, and many more possibilities. The diagnostic plots should help you identify what's going on.