Earlier today I was typing up a reply to this thread (LINK), but since I don't have a lot of experience with so-called fixed-effects dummy regression (I just opt for mixed-effects models), I decided to verify some of what I was about to say with some quick simulations. But I have been getting some surprising and puzzling results from these simulations that I cannot understand.
I am simulating data from 51 states, each with 6 observations on a DV (if context helps, in the thread above the DV was some measure of crime) and some arbitrary predictor. The function that I wrote to generate the random datasets is (the indentation looks unusual here because otherwise the code window kept wrapping the lines):So a small example dataset generated by this function would look like:Code:dummySim <- function(n_state, obs, intMean, intSD, slopeMean, slopeSD, errorSD){ states <- cbind(stateID = seq(n_state), stateInt = rnorm(n_state, mean=intMean, sd=intSD), stateSlope = rnorm(n_state, mean=slopeMean, sd=slopeSD)) X <- cbind.data.frame(kronecker(states, rep(1, obs)), predictor = rnorm(n_state*obs)) X$y <- X[,2] + X[,3]*X[,4] + rnorm(n_state*obs, sd=errorSD) names(X)[1:3] <- colnames(states) return(X) }
In my simulations of type 1 error rates, I systematically vary the between-state variance in the DV (i.e., the variance in the state intercepts) and the state*predictor interaction variance (i.e., the variance in the state slopes). Both of these variance components range from 0 population variance to fairly substantial population variance. For each dataset, I fit 3 models:Code:> dummySim(n_state=5, obs=4, intMean=30, intSD=3, + slopeMean=0.5, slopeSD=0.25, errorSD=2) stateID stateInt stateSlope predictor y 1 1 26.49421 0.9201386 0.66002234 28.51786 2 1 26.49421 0.9201386 0.95562609 26.20838 3 1 26.49421 0.9201386 1.30289558 27.13236 4 1 26.49421 0.9201386 1.05802826 28.27305 5 2 32.18166 0.4510010 -0.28521592 31.40207 6 2 32.18166 0.4510010 -1.26616689 28.88641 7 2 32.18166 0.4510010 0.78042469 35.71197 8 2 32.18166 0.4510010 -0.91565754 30.21702 9 3 35.40669 0.2236030 0.94620460 38.57545 10 3 35.40669 0.2236030 0.03890289 34.37247 11 3 35.40669 0.2236030 0.14026280 38.67710 12 3 35.40669 0.2236030 0.16238382 36.72443 13 4 34.54017 1.0061474 0.72663607 35.71450 14 4 34.54017 1.0061474 -2.11539631 34.02819 15 4 34.54017 1.0061474 -0.23600889 33.19148 16 4 34.54017 1.0061474 -1.66420728 30.27936 17 5 30.73187 0.2018658 0.32648215 31.97050 18 5 30.73187 0.2018658 0.93705671 31.40152 19 5 30.73187 0.2018658 -0.74421331 32.80946 20 5 30.73187 0.2018658 -2.82270689 28.07296
The results that I expected were:
- A simple OLS regression; no dummy variables, random effects, or anything like that. This is a model that completely ignores nonindependence in the data.
- A regression with a dummy-coded factor representing each of the 51 states (so there are 50 dummies). This model equates states on average crime rates (or whatever the DV is taken to represent), but assumes zero or close to zero variance in the state slopes.
- A model with state dummies AND all interactions between the dummies and the predictor variable. This model allows states to have different intercepts and different slopes. (As you will see in the code below, technically the state factor in this case is not a "dummy" variable; I use Helmert contrast codes so that the predictor slope is estimated on average across all of the states, rather than at the level of the baseline state.)
Now here is the code I used for my simulations and the results I obtained. The three models from above are respectively named noDum, Dum, and Full. The results with the "b" prefix represent the average parameter estimate, and the "p" prefix represents observed type 1 error rate.
- When both variance components (state intercepts and state slopes) were 0 on average, the type 1 error rates for all 3 models would approximate the nominal alpha level of .05.
- As state intercept variance increased, the OLS regression would have an increasingly anticonservative bias (type 1 error rate > nominal alpha level), but the two dummy models would not be affected.
- As state slope variance increased, the OLS regression and the dummy model without the predictor interactions would both show an increasingly anticonservative bias, but the third model would not be affected.
- Possibly there would be a multiplicative effect on bias when increasing both variance components (was not really sure), but the dummy model with interactions should still be well behaved.
As you can see, this is not at all what I was expecting. First, the OLS regression almost always showed the least amount of anticonservative bias. Increasing the state intercept variance while keeping the state slope variance at 0 did not affect any of the bias rates. Increasing the state slope variance increased the bias in all of the models on average. Finally, when there was positive variance in the state slopes, increasing the variance in the state intercepts actually decreased the bias in the OLS regression, and did not affect the other two models.Code:iterate <- function(intSD, slopeSD){ dat <- dummySim(n_state=51, obs=6, intMean=30, intSD=intSD, slopeMean=0, slopeSD=slopeSD, errorSD=1) noDumModel <- lm(y ~ predictor, data=dat) dat$stateID <- factor(dat$stateID) dumModel <- lm(y ~ stateID + predictor, data=dat) contrasts(dat$stateID) <- contr.helmert(51) fullModel <- lm(y ~ stateID*predictor, data=dat) bNoDum <- coef(summary(noDumModel))["predictor","Estimate"] bDum <- coef(summary(dumModel))["predictor","Estimate"] bFull <- coef(summary(fullModel))["predictor","Estimate"] pNoDum <- coef(summary(noDumModel))["predictor","Pr(>|t|)"] <= .05 pDum <- coef(summary(dumModel))["predictor","Pr(>|t|)"] <= .05 pFull <- coef(summary(fullModel))["predictor","Pr(>|t|)"] <= .05 return(c(bNoDum=bNoDum, bDum=bDum, bFull=bFull, pNoDum=pNoDum, pDum=pDum, pFull=pFull)) } results <- replicate(10000, iterate(intSD=0, slopeSD=0)) round(apply(results, 1, mean),4) # bNoDum bDum bFull pNoDum pDum pFull # 0.0002 0.0006 0.0009 0.0494 0.0522 0.0517 results <- replicate(10000, iterate(intSD=3, slopeSD=0)) round(apply(results, 1, mean),4) # bNoDum bDum bFull pNoDum pDum pFull # 0.0017 0.0000 0.0002 0.0496 0.0556 0.0553 results <- replicate(10000, iterate(intSD=6, slopeSD=0)) round(apply(results, 1, mean),4) # bNoDum bDum bFull pNoDum pDum pFull # 0.0048 0.0009 0.0005 0.0497 0.0497 0.0479 results <- replicate(10000, iterate(intSD=0, slopeSD=0.5)) round(apply(results, 1, mean),4) # bNoDum bDum bFull pNoDum pDum pFull # 0.0001 -0.0003 0.0007 0.2032 0.1819 0.1425 results <- replicate(10000, iterate(intSD=3, slopeSD=0.5)) round(apply(results, 1, mean),4) # bNoDum bDum bFull pNoDum pDum pFull # -0.0002 0.0004 -0.0003 0.0717 0.1777 0.1456 results <- replicate(10000, iterate(intSD=6, slopeSD=0.5)) round(apply(results, 1, mean),4) # bNoDum bDum bFull pNoDum pDum pFull # -0.0044 0.0006 0.0004 0.0592 0.1900 0.1433 results <- replicate(10000, iterate(intSD=0, slopeSD=1)) round(apply(results, 1, mean),4) # bNoDum bDum bFull pNoDum pDum pFull # 0.0023 0.0016 0.0013 0.3530 0.3248 0.3303 results <- replicate(10000, iterate(intSD=3, slopeSD=1)) round(apply(results, 1, mean),4) # bNoDum bDum bFull pNoDum pDum pFull # -0.0010 -0.0017 -0.0012 0.1291 0.3242 0.3299 results <- replicate(10000, iterate(intSD=6, slopeSD=1)) round(apply(results, 1, mean),4) # bNoDum bDum bFull pNoDum pDum pFull # 0.0031 -0.0008 -0.0001 0.0719 0.3352 0.3306
I am currently at a loss to explain these simulation results. I am hoping that some of you can help me figure out what is going on here.
In God we trust. All others must bring data.
~W. Edwards Deming
Tweet |