Sample size of Categorical Variables


I'm running a binary logistic model that includes categorical variables.

One of my main covariates has 8 categories and some of the sample sizes within categories are small (e.g. 25 people) but this is in a total sample of 8,537 people. However when you restrict it down to the variable of interest this leaves only 1 person (i.e. 1 person out of 25)

Does the small sample of a category within a categorical variable create any problems? If so what is a good cut-off point for groups within categorical variables?

Any suggestions would be much appreciated. Thank you


Mean Joe

TS Contributor
Does the small sample of a category within a categorical variable create any problems? If so what is a good cut-off point for groups within categorical variables?
Make sure the category with only 1 person is not the reference category, or the odds ratios are likely to come up +/- infinity.
Not sure what you mean by a "cut-off point for groups"; are you considering combining categories? Because that is one thing you could do. eg if you have 8 categories of happiness, you could combine extremely sad/very sad/sad sad. I'd suggest combining categories in a way that makes sense, rather than to achieve the objective of getting more N.

eg combine extremely sad/very sad/sad sad may be good because there is not much difference in those 3 categories, and they are clearly separate from the other categories. Whereas combining slightly sad/meh/slightly happy because individually they only have N=13, 1, 10 but together they'd have "enough", is not as good because you'd be combining a degree of "sad" with a degree of "happy".
Thanks for your reply. Yes, the largest group (n=8905) is the reference group.

By 'cut-off point' I meant limit to how small the groups can be. Have not found any information on this yet so hoping this is not a big issue. Although I can imagine it may create large confidence intervals for the estimate.

Yes, there is slightly different way that I can do which reduces the groups to 4. I'm looking at a binary variable across three time sweeps.


Fortran must die
I have never seen anything on the required sample size of one category. If a very high percentage of a dummy variable (say 90 percent) is in one category it will make you slopes smaller than they should be. This is essentially a lack of variation in a variable, but it is a particularly serve problem in a dummy variable. The key though is the percent in each group not so much the n.
Yes the reference category has a large percentage (82.7%) in one case, and this is for a categorical variable (5 categories) rather then a dummy variable. Funny thing is though is that I am getting statistically significant results with the smaller categories exactly according to my hypothesis.
I have not come across any kind of tests I can do to ensure the proportion of categories are correct. However there is some missing data so looking to impute the values which may make the proportion of the reference category smaller.


Less is more. Stay pure. Stay poor.
You should look up "Overfitting" and Overparameterization". It sounds like you have a fairly large sample size, but reviewing these topics may help you understand the repercussions and concerns. A general rule of thumb used by some people for binary (dependent) regression, is 10-20 observations in the smaller of the binary outcome groups per independent variable. This would include all candidate and interaction variables tested. A nice short subsection on this is available in Frank E. Harrell, Regression Modeling Strategies. I agree, that I have not heard of any subgroup sample size restrictions, but regularly wonder about it, becauce how representative can a few individuals be toward an entire subgroup.
Thank you so much! I have got the book out from the library and will have read.

I've found this paper on overfitting which explains how it may create 'over optimistic' results: "What you see many not be what you get: A brief, nontechical introduction to overfitting in regression-type models"

I've come across 'number of event per predictor' (EPV). From my understanding this means the sample of your event of interest in your outcome variable by the number of explanatory variables . They recommend 10-20 EPV. But yes not much I've found about actual samples within the explanatory variables. And I guess each category within a categorical variable would count as a single variable thus better to collapse categories where you can?!

Thanks again for your help.