# How can this be true [linear versus logistic regression]

#### noetsi

##### Fortran must die
This comes from a government study that was used to generate something of great importance, funding and related factors for a major government program. It contradicts pretty much everything I have read in the last decade (and learned in class).

For simplicity and speed and because of the large number of models estimated, the models were estimated using linear probability models, even when the dependent variable was binary. Logit and probit estimation techniques are generally recommended for estimating equations with zero-one dependent variables. However, the authors of the methodology reported that using logit or probit made it more difficult to interpret the results and created some complexities in calculating adjustments.
Interpretation of odds ratios or slopes from logistic models are more difficult to interpret, but interpreting linear models with a binary DV are simply wrong - or so I have always read

For example, they stated that because logit and probit are non-linear models, the adjustment factor could not be calculated using sample means but rather required calculating probabilities for all observations using the full set of data.
I don't understand what this means. What they were doing was estimating slopes of variables which they then used with other data for the X to estimate requirements for agencies to meet. That is in this first part they, I think, were creating slopes then in the second part they used these slopes and current data on the IV to estimate what the goals of the agency [the DV] should be.

shown that the drawbacks of using linear probability models, compared with logit and probit techniques, were minimal.
That is news to me. I have read the exact opposite.

#### noetsi

##### Fortran must die
They go onto to say this (I think this involves estimating the slopes that are used in the second stage, although I am not certain).

In order to test the sensitivity of the estimates to this simplification, both techniques for entered employment and retention performance measures for the WIA Adult program were estimated. The coefficients estimates were found to be quite similar if not virtually identical in most cases.
So why do we do logistic regression if there is no difference between it and linear regression according to the US government for binary DV

#### CowboyBear

##### Super Moderator
Try to think it through yourself instead of worrying about what authorities say. So to start:

When you have a binary DV, which assumptions of the linear OLS model are breached?
What properties of the OLS estimator are those assumptions required for?

#### rogojel

##### TS Contributor
hi,
from a practical POV, isn't the argument that in the middle range (probabilities relatively far from 0 or 1 ) the OLS will lerform well, the problem being that it can predict senseless values at the extremes?

regards

#### noetsi

##### Fortran must die
Try to think it through yourself instead of worrying about what authorities say. So to start:

When you have a binary DV, which assumptions of the linear OLS model are breached?
What properties of the OLS estimator are those assumptions required for?
I don't know all the violations, but two I remember. First the data will be always heteroscedastic. Second, nonsensical slopes can be found.

Since I don't consider myself particularly good at statistics, what experts say matters to me And more to the point, this is not just a theoretical matter. It involves the setting of goals that my agency, and most DOL and DOE organizations will have to meet - or there will be major consequences. So if the metrics was set wrong, presumably by real statisticians, that is sort of important.

#### hlsmith

##### Omega Contributor
They obviously trend in the same way. I would say deviating away from logistic seems sketchy to me, in that you run the risk of model misspecification. They probably made that statement so everyone would be on the same "scale" per se and to make it easy for those that are not familiar with logistic. Seems lazy and if their staff can't run both, then maybe they aren't the right people. They just need to come up with boil plate language how to interpret both for the stats illiterate people who use the results.

I bet it revolves around the difficulties of conveying results to politicians and them using the results.

#### noetsi

##### Fortran must die
The analysis is highly complex, these are clearly expert econometricians.

It appears that econometricians, some of them anyhow, have decided that since results in logit and OLS [linear probability models when predicting binary variables] often are very similar its ok to use OLS. Part of this involves when you're estimating certain range of results apparently, the more results are near extreme the less well linear probability does. But in many cases you are not estimating extreme values so that is not an issue. Second, they argue that the inherent heteroscedastcity can be eliminated with White SE [not sure that is true, but they believe it]. Finally, they argue that while linear probability models are sometimes wrong, so are logistic models [that is wrong in predicting binary variables without nonsensical results - but this may also deal with mispecification].

#### hlsmith

##### Omega Contributor
noetsi,

Do you have a link to the source of what you are referencing so we can better put it into context?

#### noetsi

##### Fortran must die
It is a pdf sent me for which I have no link. This is the pertinent comment by the authors.

For simplicity and speed and because of the large number of models estimated, the models were estimated using linear probability models, even when the dependent variable was binary10. Logit and
probit estimation techniques are generally recommended for estimating equations with zero-one dependent variables. However, the authors of the methodology reported that using logit or probit
made it more difficult to interpret the results and created some complexities in calculating adjustments. For example, they stated that because logit and probit are non-linear models, the adjustment factor
could not be calculated using sample means but rather required calculating probabilities for all observations using the full set of data. Further, the argument was made that econometricians had
shown that the drawbacks of using linear probability models, compared with logit and probit techniques, were minimal. In order to test the sensitivity of the estimates to this simplification, both
techniques for entered employment and retention performance measures for the WIA Adult program were estimated. The coefficients estimates were found to be quite similar if not virtually identical in
most cases.
I do have a link to the econometric book that establishes to the authors linear probability models are satisfactory equivalents to logistic regression.

If you can use linear probability models for binary variables, why ever run logistic regression? Slopes in logistic regression are very difficult to interpret, you get no true R square, and many test that exist for linear models do not exist with logistic regression [including diagnostics].

Last edited:

#### hlsmith

##### Omega Contributor
Don't forget that binary outcomes can also be put on the risk scale and used for relative risks and risk differences. These allow you to calculate relative risk reduction, absolute risk reduction, number needed to harm, and number need to treat (e.g., how many people do you have to intervene on to get another outcome of interest compare to the other group).

#### noetsi

##### Fortran must die
The problem with that is that I have not found, and I tried really hard to do so several years ago, to calculate relative risk in SAS. Do you know a way to generate relative risk in SAS?

#### hlsmith

##### Omega Contributor
Yes, if i remember I will send links tomorrow. It likely uses the GLM procedure.

#### CowboyBear

##### Super Moderator
I don't know all the violations, but two I remember. First the data will be always heteroscedastic. Second, nonsensical slopes can be found.
What are the consequences of violation of the assumption of homoscedasticity? What other assumptions are there? Is it true that odds ratios are always harder to interpret than linear slopes? How might the usefulness of logistic vs linear regression differ depending on whether the goal is explanation or prediction?

Since I don't consider myself particularly good at statistics, what experts say matters to me
Basically I'm trying to get you to think things through critically yourself - you're perfectly capable of this Simply asking what the experts conclude works only when they're all in agreement (i.e., never!) But we can critically evaluate the arguments being put forward by experts and think about when they are and aren't valid. That critical authority-questioning attitude is an essential part of a scientific mindset (regardless of where you're trying to do science).

#### noetsi

##### Fortran must die
In linear regression the consequences of hetero is that the SE are wrong and thus the statistical test is unreliable. In my opinion, particularly for practitioners, linear slopes where a change in x is associated with a change in Y is much easier to interpret than that the odds of something occurring increase (which is what odds ratios get at). I have never discussed odds ratios with a non-statistician where they understood them. Admittedly that could be my fault, but in this case I doubt it. I only deal with prediction (or relative importance of a given variable where logistic regression is much harder than linear regression IMHO particularly because there is little agreement on how to do this in logistic regression). One example of why linear regression is simpler/better is that while there is a well agreed on measure to get at how important linear models were in explaining variance (R square or adjusted R square) there are something like 32 different pseudo R squares in logistic regression none of which are easy to explain and none of which seem to generate broad support in the statistical community.

I agree that statisticians never agree (for long, even when there is agreement that changes over time) this is the nature of academics (as someone who spent much of his adult life in universities).
As a non-academic these days I want to do simple things. Predict some result, or show what has the greatest impact correctly. Unfortunately, as you suggest, not only is there no agreement on most matters there are surprisingly few test which method works best. They occur occasionally for example in the M competitions for time series in the early 80's but I have almost never run across cases where someone tested which method works better in simulated results. Sometimes, but not very often.

Critical thought, particularly in math based areas, hurts my head. That is why I am not in academics.... (well that and the fact that they hire about 7 of my specialty nationwide each year).

#### hlsmith

##### Omega Contributor
How might the usefulness of logistic vs linear regression differ depending on whether the goal is explanation or prediction?

I like this, though I will add when using logistic regression regularly, I also translate results into probabilities very easily. You discuss R^2 regularly but is that really the best measure when you have a binary outcome. In logistic regression, the variance is not as intuitive as calculating the accuracy, what percent of the outcomes are you predicting accurately. I think the R^2 gets stuck in the linear reg person's mind but the c-statistics is the end all measure of interest.

#### CowboyBear

##### Super Moderator
Critical thought, particularly in math based areas, hurts my head. That is why I am not in academics....
I know you're half kidding, but critical thinking is crucial for everyone who wants to find out stuff about the world, not just those in academia!!! :O