IV Regression with binary outcome


Phineas Packard
I have a binary variable y which I am aiming to model using instrumental variable regression. Both X and the IV are continuous. ivpak in R does not seem to provide a means for accounting for a binary outcome. Is there a means of accounting for this in R?
Since ordinary least squares (OLS) is given by:

b =(X'X)^(-1)X'y

And since the instrumental variable variable (by wikipedia) is:

b_iv = (Z'X)^(-1)Z'y

(Where y is dependent variable, X the matrix of explanatory variables and Z the matrix of instrumental variables.)

And since in a generalized linear model (glm), in particular in a binary logit or probit model, maximum likelihood estimates is given by an iterative reweighed least squares (irls):

beta = (X'WX)^(-1) X'Wz

Here the W is a weight matrix (like in weighted least squares) and z is a pseudo-dependent variable (Taylor series of the link funktion) (Look at this)

Because of this I would guess and think that it is reasonable to use the following estimator, i.e. to just insert an instrumental variable in an irls.

beta_iv = (Z'WX)^(-1) Z'Wz

(The current beta_iv would give the current weight matrix W and that with the formula would give an updated value of beta_iv.)

I thought that one could just plug in this in R and use the matrix multiplication and get the result.

But Lazar asked for software in R.

When I looked a little bit more I saw this and under Instumental variables I saw the text "Binary responses : An IV probit model via GLS estimation is available in ivprobit ", and the packages ivprobit.


Can't make spagetti
Hey, we had a little bit of a conversation about probit/logit/LPM models a few days ago and the issue of endogeneity came up. Would you mind sharing with us a little bit more of what you're doing and why you're taking the LPM + instrumental variables approach to it?


Cookie Scientist
I assume he's not using LPM but rather logistic regression, otherwise there'd be no problem and he'd just use ivpack, right?


Less is more. Stay pure. Stay poor.
Spunky, you can have endogeneity: W -> X -> Y, and you can have the same thing but a confounder for X and Y in the prior model. Controlling for W then in the model can help get pass the confounder to estimate effects, since the confounder may be unobserved, so you can directly control for it. This kind of plays into the Markov Process of you don't care about W because it is upstream, but if you have confounding, W becomes your best friend.

A related topic, Mendelian Randomization, is the coolest thing. This is when you have the same scenario, but you have a gene that randomizes you into a group and it is the upstream factor. E.g., lactose intolerance, alcohol dehydrogenase (poor processing of alcohol).