Minimum number of cases for the reference level

noetsi

Fortran must die
#1
Before I ask this I have the true population so I am not sure this issue even applies :p

Is there a minimum number of cases the reference level has to have to be valid in the regression analysis? I have a population of about 11,000 cases. For substantive reasons (it makes the most sense to use this reference level substantively) my reference level is one that turns out to have only about a hundred cases. I prefer for analytical reasons to leave it the reference level, but I get nervous having so few cases as a reference level. There are about 19 variables in the model.

I have not seen anything for linear regression that says how many cases you need or if the ratio of cases in the reference level to one of the related dummies matters.
 

Jake

Cookie Scientist
#2
No. I mean, you need at least 1 in the reference group (and more than 1 in at least one of the other groups). Any more than that is bonus.
 

hlsmith

Omega Contributor
#3
Side note, many things run better with balanced designs, but it is not obligatory/mandatory. I believe the SE may get a little larger in some cases, say in odds ratios and when looking at case/control studies there are optimal tradeoffs for power, and some things take longer to converge (e.g., mixed models). But this may be one of those, the effects are stable enough, but confidence is what gets compromised.


If you have reservations, just keep that in mind during your interpretations. If you are only interested in that particular variable, you could always balance covariates (e.g., 19), via matching or propensity scores.
 

Jake

Cookie Scientist
#4
For a fixed total sample size, parameter estimates are most precise when the data are balanced across all categorical factors. So, for example, it is more efficient to have n=10 in each of two groups than to have n=5 in one group and n=15 in the other. But if the comparison is between, say, having n=10 in both groups vs. having n=10 in one group and n=50 in the other group (in other words, if we are not talking about a fixed total sample size), then the latter will be more efficient, all other things equal, owing to the larger sample size. But the basic point is that there is nothing inherently bad about having data that are highly unbalanced across the categorical predictors.
 

noetsi

Fortran must die
#5
That is good to know because the non-experimental designs I work with have huge variances between groups. On the other hand I commonly have at least hundreds if not thousands of cases. An issue I had not thought concerns power. I always thought of power as the total number of cases in the design (in this example that would be over 11,000). But is power instead tied to the number of cases in one of the subgroups (here a reference level of one dummy variable)? I assume this would only impact the power for that dummy variable, not the overall model or other variables in the model.

But obviously I am not certain:p
 

Jake

Cookie Scientist
#6
Power is a joint function of the total sample size and the degree of balance across the predictor categories (because the latter affects the degree of multicollinearity -- the more unbalanced the groups, the more collinear are the predictors).
 

hlsmith

Omega Contributor
#7
Power is a joint function of the total sample size and the degree of balance across the predictor categories (because the latter affects the degree of multicollinearity -- the more unbalanced the groups, the more collinear are the predictors).

Can you provide a hypothetical example of this. In my mind it seems like a sparsity of data in subgroups. So with continuous variables you have variables held at their mean, but in categorical scenarios you have multiple variables set at their reference group, plus the potential sucking up of degrees of freedom.


Though, your description almost seems more like confounding. I guess it could be like collinearity if some subgroups have some many dimensions that no one in the group has the outcome, so a breach in positivity. so multiple variables seem linked to the outcome per being in the same subgrouping.
 

noetsi

Fortran must die
#8
I know that the number of subgroups is important to power calculations and you need a certain sample size per group [know because I have seen the calculations - I don't know why this is true]. But I don't really understand why a balanced or unbalanced design would matter for power.