Dummy variable question

noetsi

Fortran must die
#1
I have 10 dummy variables that show what job you close with. There are far more than 10 types of jobs you could close in, we were only interested in these 10. That is there is a categorical variable that has 23 levels, we are only including 10 of the possible levels as a dummy (1 you are in this job, 0 you are not). The point is that I am not sure if I have to leave one of these levels out as you would when you coded say a variable with 3 levels (you would only code 2 of the levels). Here we are leaving out many levels already (I am not sure if that is acceptable).

Another question is about interpretation of the slopes. When you code dummy variables in this fashion is it any different than coding every level of a dummy variable and leaving one level out (other than obviously one level is not being left out as a reference level).
 

obh

Active Member
#2
H Noesti :)

I assume the data will include all the 23 categories.
If so you are grouping to 11 categories 1,2..,9,10 and other. so you need 10 dummy variables and when all variables will be 0 it mean "other"
 

noetsi

Fortran must die
#3
No the data does not include 13 of the 23 levels. They chose the 10 most important levels only. My question is, in large part, do they need the other 13 categories to be included in the data as a dummy variable. Or can I just leave it out and have no reference level.
 

noetsi

Fortran must die
#5
yes. They did not even have the regression in the data let alone the regression. The federal department of labor does the same thing essentially with their regression models. They show the percent of the total work force in certain jobs in a series of dummy variables. And the variables they have in the model only adds up to about 77 percent of the total work force.
 

obh

Active Member
#6
If you make research over 3 cities A, B, C how many dummy variables will you use? 2. But there are another 2000 cities in the country?
So in your case why not using 9 dummy variable?

Independently, why not using all the data with "other" as I suggested before.? in this case you will have more data for the regression, but you won't have more than 10 dummy variables
 
Last edited:

noetsi

Fortran must die
#7
the major reason is that it would take to much time to change the code that pulls in the data and then run the regression :) As long as the parameters of the 10 variables I pull in are not biased I don't really care at all about the other 13.
 

obh

Active Member
#8
Hi Noetsi,

Since when do we let time interfere with our statistics ;)

I think it should not be a problem, per my common sense it is like you did the research over only the 10 Jobs.
(Unless you didn't "cut" the jobs well and the same person may choose job A (from the 10) or job U from the other jobs depend on his character ...)

So prelatically I don't see any problem
 

noetsi

Fortran must die
#9
Since I do it for a living (or ultimately get frustrated doing something multiple times ):)

The example you gave makes sense..