I'm thinking about a researcher who made the following statement about a model(upon criticism of his interpretation of the data).
"There is a number of these univariate correlations in the data that do not fit the model (out of the thousands, there would be)"-researcher
Basically my understanding is that the criticism that was made of his research was done by someone who analyzed the correlations of his data between a few variables and found the connection he said was there between X and Y, Z WAS NOT what he said. His response was that you needed to run the regression with 1000's of variables to get the true picture. My initial thought was that he was wrong. If something doesn't make a connection in a limited number of variables, why would you 'find' a meaningful connection later?
So I ran a little experiment on a pretty popular data set that is unrelated, the baseball hitting statistics from 2000-2008, contained in the nutshell library of R. This is essentially the Moneyball model.
The full linear regression model with all the statistics included looks like this.
formula = runs ~ singles + doubles + triples + homeruns +
walks + hitbypitch + sacrificeflies + stolenbases + caughtstealing
(Intercept) -507.16020 32.34834 -15.678 < 2e-16 ***
singles 0.56705 0.02601 21.801 < 2e-16 ***
doubles 0.69110 0.05922 11.670 < 2e-16 ***
triples 1.15836 0.17309 6.692 1.34e-10 ***
homeruns 1.47439 0.05081 29.015 < 2e-16 ***
walks 0.30118 0.02309 13.041 < 2e-16 ***
hitbypitch 0.37750 0.11006 3.430 0.000702 ***
sacrificeflies 0.87218 0.19179 4.548 8.33e-06 ***
stolenbases 0.04369 0.05951 0.734 0.463487
caughtstealing -0.01533 0.15550 -0.099 0.921530
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 23.21 on 260 degrees of freedom
Multiple R-squared: 0.9144, Adjusted R-squared: 0.9114
F-statistic: 308.6 on 9 and 260 DF, p-value: < 2.2e-1
I then tried to build it imagining what if I had only collected 2 pieces of data and compared them to runs? Just using runs, singles, and doubles and still got a high degree of significance but cut the R squared to a little over one third its value at full model.
formula = runs ~ singles + doubles, data = team.batting.00to08)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22.88514 70.91583 -0.323 0.747
singles 0.39419 0.06201 6.357 8.86e-10 ***
doubles 1.38342 0.14626 9.458 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 63.13 on 267 degrees of freedom
Multiple R-squared: 0.3497, Adjusted R-squared: 0.3448
F-statistic: 71.8 on 2 and 267 DF, p-value: < 2.2e-16
Next, I added triples expecting the same degree of significance to show up in the model as in the final, but instead triples comes out negatively correlated with runs.
formula = runs ~ singles + doubles + triples, data = team.batting.00to08)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22.10399 71.11347 -0.311 0.756
singles 0.39602 0.06257 6.329 1.04e-09 ***
doubles 1.38603 0.14691 9.434 < 2e-16 ***
triples -0.10873 0.44650 -0.244 0.808
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 63.25 on 266 degrees of freedom
Multiple R-squared: 0.3499, Adjusted R-squared: 0.3425
F-statistic: 47.72 on 3 and 266 DF, p-value: < 2.2e-16
Summarizing, model 1 was the full model, model 2 was singles and doubles, model 3 added triples to singles and doubles. I expected the model to slowly build up to to it's high R-Squared and for the correlations to stay closely the same, but in model 3 triples show up as not significant and negatively correlated, while in the final model they show up as positively correlated and very significant. Any explanation for this phenomenon? Is this a similar problem to the criticism of the researcher's data by taking a few univariate correlations?
"There is a number of these univariate correlations in the data that do not fit the model (out of the thousands, there would be)"-researcher
Basically my understanding is that the criticism that was made of his research was done by someone who analyzed the correlations of his data between a few variables and found the connection he said was there between X and Y, Z WAS NOT what he said. His response was that you needed to run the regression with 1000's of variables to get the true picture. My initial thought was that he was wrong. If something doesn't make a connection in a limited number of variables, why would you 'find' a meaningful connection later?
So I ran a little experiment on a pretty popular data set that is unrelated, the baseball hitting statistics from 2000-2008, contained in the nutshell library of R. This is essentially the Moneyball model.
The full linear regression model with all the statistics included looks like this.
formula = runs ~ singles + doubles + triples + homeruns +
walks + hitbypitch + sacrificeflies + stolenbases + caughtstealing
(Intercept) -507.16020 32.34834 -15.678 < 2e-16 ***
singles 0.56705 0.02601 21.801 < 2e-16 ***
doubles 0.69110 0.05922 11.670 < 2e-16 ***
triples 1.15836 0.17309 6.692 1.34e-10 ***
homeruns 1.47439 0.05081 29.015 < 2e-16 ***
walks 0.30118 0.02309 13.041 < 2e-16 ***
hitbypitch 0.37750 0.11006 3.430 0.000702 ***
sacrificeflies 0.87218 0.19179 4.548 8.33e-06 ***
stolenbases 0.04369 0.05951 0.734 0.463487
caughtstealing -0.01533 0.15550 -0.099 0.921530
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 23.21 on 260 degrees of freedom
Multiple R-squared: 0.9144, Adjusted R-squared: 0.9114
F-statistic: 308.6 on 9 and 260 DF, p-value: < 2.2e-1
I then tried to build it imagining what if I had only collected 2 pieces of data and compared them to runs? Just using runs, singles, and doubles and still got a high degree of significance but cut the R squared to a little over one third its value at full model.
formula = runs ~ singles + doubles, data = team.batting.00to08)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22.88514 70.91583 -0.323 0.747
singles 0.39419 0.06201 6.357 8.86e-10 ***
doubles 1.38342 0.14626 9.458 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 63.13 on 267 degrees of freedom
Multiple R-squared: 0.3497, Adjusted R-squared: 0.3448
F-statistic: 71.8 on 2 and 267 DF, p-value: < 2.2e-16
Next, I added triples expecting the same degree of significance to show up in the model as in the final, but instead triples comes out negatively correlated with runs.
formula = runs ~ singles + doubles + triples, data = team.batting.00to08)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22.10399 71.11347 -0.311 0.756
singles 0.39602 0.06257 6.329 1.04e-09 ***
doubles 1.38603 0.14691 9.434 < 2e-16 ***
triples -0.10873 0.44650 -0.244 0.808
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 63.25 on 266 degrees of freedom
Multiple R-squared: 0.3499, Adjusted R-squared: 0.3425
F-statistic: 47.72 on 3 and 266 DF, p-value: < 2.2e-16
Summarizing, model 1 was the full model, model 2 was singles and doubles, model 3 added triples to singles and doubles. I expected the model to slowly build up to to it's high R-Squared and for the correlations to stay closely the same, but in model 3 triples show up as not significant and negatively correlated, while in the final model they show up as positively correlated and very significant. Any explanation for this phenomenon? Is this a similar problem to the criticism of the researcher's data by taking a few univariate correlations?