[Stata - Regression with categorical variables correlated] Multicollinearity

#1
Hello!

I'm currently working on my thesis and I have a problem. I'm doing a regression where many variables are categorical ones (profession, neighbourhood, religion, etc).

In my regression, I've already omit one category for each categorical variable (which is needed to avoid collinearity) but when performing my regression, Stata still omits other variables because of collinearity (maybe with other variables)...

The problem is that normally you interpret the regression coefficient of one category in comparison with the reference category (deliberately omitted). But now, if several categories of my variables are omitted, I don't know how to interpret anymore..

I've been looking for days on the internet but found nothing about that.

Thanks for your help!!
 
#2
The reference group becomes the joint group of all categories that are omitted. Suppose you have a variable 'education' with 3 categories: low, middle and high. If you omit 'low' (your base category) and the software also drops 'middle' to avoid collinearity, then the estimated coefficient for 'high' would the effect of being highly educated compared to not being highly educated (i.e. the groups 'low' and 'middle' together).

You may also want to try to avoid the situation where software automatically drops variables. This usually results from a low number of observations in that category, which be avoided by joining some categories together before running the regression.

Hope this helps!
 
#3
Thanks for your answer!!

I get your point, but it's a bit more tricky for my categories since they are very qualitative.

For instance, if my dependent variable is expenses in funerals and one of my explanatory variable is profession: farmer, merchant, civil servant and retired for instance. What if I have to omit both civil servant and retired? Can I say for instance that farmers spend less in funerals than civil servants and retired if the regression coefficient is negative?

Besides, maybe only a few professions are significant, so if I have to omit these ones, other professions could be not significant and thus no conclusion could be drawn...

Thank you in advance :)
 
#5
I would recode the variables to join two categories together and call the new group 'other'. But think about how best to do this. Start by looking at the frequencies in each group: say there are very few civil servants in your sample, I would group those observations together with one of the other groups. If you want to test whether farmers spend less than retired, then I would join the civil servants together with the merchants (= new group 'other') and make the farmers (or the retired, that is equivalent) the omitted base category.