Linear Probability Models

noetsi

Fortran must die
#1
There is a dispute whether OLS is valid when the dependent variable takes on 2 levels [in which case one can use linear probability models] or if one must use logistic regression. This seems central to me to the dispute [although the violation of normality and hetero is inherent in LPM I have whole populations so I am not as concerned about that].

"“These considerations suggest arule of thumb. If the probabilities that you’re modeling are extreme—close to 0or 1—then you probably have to use logistic regression. But if the probabilities are more moderate—say between .20 and .80, or a little beyond—then the linear and logistic models fit about equally well, and the linear model should be favored for its ease of interpretation.”

My question is how do you know if most probabilities are between those values. Ideally I would like the answer in how do you find this in SAS, but I will take what I can get." :)
 

noetsi

Fortran must die
#4
Being new to this I am confused. For each case there is a probability of them taking on a value of 1 and 0 (the dv has two levels). When they say that the probabilities are largely between .2 and .8 are they talking about the probability of 1 or of 0 or some combination? About 2/3 of the dv has a value of 0.
 

noetsi

Fortran must die
#6
I am not sure this is what is meant, but if I understand what they mean by most values are between a probability of .2 and .8 our data set does not seem to meet that requirement.

1561424327905.png
 

noetsi

Fortran must die
#10
Thanks Dason. The issue will be that the federal government has chosen this model based on economic advice and has used it for 20 years. I have to show them very formally its wrong if it is.
 
#13
Probability and odds have a nonlinear relationship.

http://www.talkstats.com/threads/nonlinear-odds-to-probs-conversion.73716/

Distortion (error, misunderstanding) is created when you force a linear functionality between odds and their implied probabilities, especially at the extremes ... which could be a significant factor in the 'longshot bias'.

1562173447825.png
Note: The absolute error -- the difference between linear and nonlinear conversions -- appears to be shrinking as the underdog's odds increase, but the relative error goes thru the roof.
 
Last edited:

noetsi

Fortran must die
#14
One thing I find puzzling. I ran the data set addressed above (in part). About 30 predictors and a dependent variable with two levels 0 and 1. I generated an ODDS ratio and then ranked the predictors (most but not all of these were dummy variables) from highest to smallest odds ratio (for odds ratio below 1 I took the inverse of the odds ratio before ranking, other wise an odds ratio of one would show greater impact than .1 which is obviously incorrect).

Separately , I ran the same variables through linear regression (a linear probability model). I ranked the variables based on the absolute value of their slope. I expected the relative rankings to pretty close between the two approaches. But in fact there are some significant differences.

My guess is this is caused because the linear model used OLS and the logistic regression used maximum likelihood. I can not think of any other reason the rankings would be different.