Dummy variables

noetsi

Fortran must die
#1
The federal government has created dummy variables (or anyhow that is the interpretation of my agency) that allow someone to be in more than one level (more than one of the dummy variables for a category). I have no authority to change this. I know that violates the regression assumptions, but not how it will impact the regression effect sizes.
 

hlsmith

Less is more. Stay pure. Stay poor.
#2
Can you give an example? So could someone have a gender male and female? Just making sure given your use of the word level.
 

noetsi

Fortran must die
#3
No it has to do primarily with education. It has 6 levels. If you have a master's degree, for example, you can be coded 1 for graduate degree, 1 for BA and so on down the line so you show up in all six dummy variables for education. That is you can be coded 1 in all levels of education
 

hlsmith

Less is more. Stay pure. Stay poor.
#4
But can't you correct that, so if they have a graduate degree you just assign them 0's for the other lower level categories?
 

noetsi

Fortran must die
#5
I did that in my analysis hlsmith (exactly that). But the federal government does not. You can be in more than one level in their dummy variables. And ultimately we have to run their model. So my question is what difference does it make to the analysis. Dason suggested, contrary to what I thought, that it does not matter. You can be more than one level. But does that have impact on the analysis?
 

hlsmith

Less is more. Stay pure. Stay poor.
#6
Yeah I am with you that it seems wrong. What if you were comparing people's income based on education. Say you had 15k with undergrad and 5k with grad degrees, so you say the gov would compare 20k undergrad vs 5k grad salaries? That would obviously be wrong. What is the actual model they use?
 

noetsi

Fortran must die
#7
Here is an example:

Say some one has a master's degree. They look at graduate degrees (one dummy), BA (a second), AA degrees (a third), some college no degree (a 4th), and high school (a fifth I am making this easier than their system).

Say you got a MA, a BA, earned a AA and graduated from high school. It would code you one in every one of these dummies.

I changed my code so if you get a MA it won't show you in a BA and below (and a BA it won't show below that level and so on). So it essentially analyzes the value of the highest degree I think. But I am not sure what analyzing their model is showing.
 

hlsmith

Less is more. Stay pure. Stay poor.
#8
I pretty much follow what you are writing. I would run things both ways to show there is a difference and they are asking two different questions.
 

noetsi

Fortran must die
#9
Using the Linear Probability Model to Estimate Impacts on Binary Outcomes in Randomized Controlled Trials
https://opa.hhs.gov/sites/default/files/2020-07/lpm-tabrief.pdf

Is it true that the criticisms of linear probability models don't apply with binary predictors? All the predictors I care about in my present model are dummy variables. I don't think non-linearity actually exists with dummy variables and that is one of the major critiques of linear probability models (heteroskedasticity is still an issue, but not a concern to me since I have the population. I don't care about p values at all). Some predictors in the model are interval, but I am not analyzing them yet.
 

noetsi

Fortran must die
#10
Paul Allison's take on this in part

• Heteroscedasticity is easily fixed with robust standard errors. •
Non-normality is a trivial problem with moderate to large size samples. •

The most intractable problem has been non-linearity, manifest by predicted probabilities greater than one or less than zero

This third problem I think is essentially about non-linearity which I don't think applies to dummy variables.

A concern because nearly all my effect sizes for the LPM are below .2 which is often where concerns are raised about them.
 
Last edited:

noetsi

Fortran must die
#11
I found this argument different
"For many it may come as a surprise to find that the variable sex, with categories ‘male’ and ‘female’ is not a nominal variable. The simple reason is that it contains only two categories and this makes it formally an interval/ratio variable."
https://arxiv.org/ftp/arxiv/papers/1511/1511.05728.pdf

I always thought of dummy variables as ordinal. This matters (I assumed wrongly) because while interval predictors can be non-linear ordinal predictors are inherently linear (except they are not I now realize).

"Dummy variables meet the assumption of linearity by definition, because they create two data points, and two points define a straight line. There is no such thing as a non-linear relationship for a single variable with only two values."

https://www.researchgate.net/post/Check-linearity-between-the-dependent-and-dummy-coded-variables

So an ordinal variable with more than two levels could be non-linear....

But they make a good point that you generate a mean difference with a dummy variable and ordinal variables can not have that. Of course that assumes what you predict is interval I think (or your making the assumptions of linear probability models).
 
Last edited:

noetsi

Fortran must die
#12
How do you interpret dummy variables when this occurs for one or more variables (they can not reasonably take on a value of 0). I know this will not of course apply to dummies - but it could apply to other predictors.