# Linear probability model

#### noetsi

##### Fortran must die
I have a linear probability where the effect sizes are almost always below .2 . I had read that could be a sign that the model was not working correctly, does anyone know. As always I have t he entire population.

#### hlsmith

##### Less is more. Stay pure. Stay poor.
@noetsi not versed in LPMs but I would have not idea why that would be true. Isn't an effect .2, just translate as the non-reference group having 20% probability increase in outcome, what would linearity have to do this this and aren't all your predictors categorical?

How big of an effect do you want and expect, 0.2 seems high to me!

#### noetsi

##### Fortran must die
All I know is once upon a time someone told me values below .2 and above .8 were problems. I thought they were talking about effects. I don't work with LPM (have not anyway I have to learn). I know very little about them.

I thought values below .2 were really small. Shows my inexperience. All the variables I care about are categorical. So they have a linear relationship with the DV. Not sure if that alleviates the problem. I historically I ran logitistic regression, but the federal government decided on LPM.

#### hlsmith

##### Less is more. Stay pure. Stay poor.
0.2 and 0.8 comes into play with using OLS for percentages, thinking this is an analog warning rule.

#### noetsi

##### Fortran must die
We want to know how well our geographic units are performing. So we decided to go with fixed effect regression. The question here is do we have to eliminate one of our geographic units, make it a reference level. Or can we run one dummy for every level.

#### Dason

You're using sas or R right? Either should take care of that for you...

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Yeah I would imagine you have to select one for the base case. I would imagine you would want to do thatyourself, such as picking some pan handle scrap area for the comparison. I took a geospatial workshop a few years ago and I felt like you could create some type of geo term to then insert it into a regression model. I knew I should have applied the concept and took notes so I would remember what was up with that.

#### noetsi

##### Fortran must die
SAS does not automatically remove these types of units (it does not realize they are part of a larger categorical variable even though each dummy is individually put in the class statement). It does not know to do so. In fact the model ran with every unit in it when I made that mistake.

If I understand hlsmith I do have to remove one, but it really does not matter much which one. I was trying to figure out a way I could create an artificial unit that was the average of the others to compare to, but given that there are actual cases associated with these I do not know how. Or if that would even be valid. I have never seen someone average that way to create an artificial level for comparison of a categorical variable.

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Well don't put them into the model as dummies - put them in as a categorical so you have control. Which one is the reference group is important. If you pick the area that is say average in the outcome you will likely fail to find the others are different from it. So if I have groups for how much a person smokes and I used the middle person and was examining cancer rates I may not find much but if I use the non-smoker or rare smokers, I will be able to discern the dose response or a difference.

#### noetsi

##### Fortran must die
I am not sure what you mean as categorical. I put them in PROC GENMOD as classification variables.

I have no real theory to choose here (this is literally never been done that I can find in the literature). Should I avoid a unit that has only a very few cases? I chose one at random, but it turned out to have only 8 cases compared to hundreds for other units.

#### noetsi

##### Fortran must die
this made me feel less stupid

When you put an indicator variable in a regression model, there are two things you must always keep in mind about interpreting the coefficients associated with the indicator variable:
1. The coefficient on an indicator variable is an estimate of the average DIFFERENCE in the dependent variable for the group identified by the indicator variable (after taking into account other variables in the regression) and
2. the REFERENCE GROUP, which is the set of observations for which the indicator variable is always zero.
If you always remember that the coefficient on an indicator variable is an estimate of a DIFFERENCE with respect to a REFERENCE GROUP (also sometimes referred to as the “omitted category”), you’re 90% of the way to understanding indicator variables.
I recognize this may feel obvious, but trust me: I’ve literally reviewed papers from tenured faculty at major Universities that get this wrong. This is something people get confused about constantly, so I promise it’s worth this treatment.

lol

#### fed2

##### Active Member
id have to agree with hlmsmith, if I had these dummy vars, I would put them in as a categorical so you have control. ie recode as single class variable and put in class statement. Sounds like a nightmare trying to code yourself, i would not trust myself to do this!

#### noetsi

##### Fortran must die
I am not sure what it means to be a categorical variable as hlsmith means it. I need to show the impact of each unit so one variable is really not an option if I understand what that means. I want to show the relative performance of the units. I am not used to fixed effect regression in honesty. I have a dummy for each predictor with an effect size (sometimes the variable being predicted is binary and sometimes interval). In either case it valid to list their effect size from most positive to most negative, showing which is generating higher income and higher percentage chance of being employed?

#### fed2

##### Active Member
not sure either i just sort of skimmed a bit and sam the words 'dummy' and categorical.. ..

#### hlsmith

##### Less is more. Stay pure. Stay poor.
In SAS you just put the variable with all of the groupings into the class statement and model and select which category in the group that will be the reference. As @fed2 noted, it seems like you are making a truckload of dummies, one for each category and then just holding that one out as the reference. I know you use SAS and that seem tedious.

#### noetsi

##### Fortran must die
I am not creating the variables they are built by the SQL code that others developed.

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Gotcha. That make sense.

#### noetsi

##### Fortran must die
Gotcha. That make sense.
It is the nature of what I do in a state agency. The variables and the pulling them commonly exist already, or I can get help with people who are extremely good at SQL. Also most of what I do is SQL anyhow, only a tiny portion of my job involves statistics.