Graphically check linearity in logit regression

#1
Hi all,

I am running a logit regression where the outcome variable is binary. I am interested in checking, graphically, whether the relationship between a given predictor and my dependent variable is (more or less linear).

I understand that the logit of the outcome ought to be linear, not the dependent variable per se. However, I do not know how to graph the logit of Y against a given X (say, the variable "age").

Also! I am looking for another way to do this without having to use the command "lowess." Any help will be so much appreciated. Thanks!

Best :D
 

noetsi

Fortran must die
#2
Why not run a Box-Tidwel test instead. You don't have to rely on graphical methods for this.

http://www.ats.ucla.edu/stat/stata/webbooks/logistic/chapter3/statalog3.htm

If you suspect non-linearity from this one solution is to create an interaction term of the predictors times the natural log of that predictor. Then use these interaction terms (and the original predictors) to predict the original DV. If any of the terms are found to be signficant it suggests that term may be non-linear in predicting the DV.
 
#3
Why not run a Box-Tidwel test instead. You don't have to rely on graphical methods for this.

http://www.ats.ucla.edu/stat/stata/webbooks/logistic/chapter3/statalog3.htm

If you suspect non-linearity from this one solution is to create an interaction term of the predictors times the natural log of that predictor. Then use these interaction terms (and the original predictors) to predict the original DV. If any of the terms are found to be signficant it suggests that term may be non-linear in predicting the DV.
Thanks for your reply. Precisely because I was reading UCLA's website and I did not get satisfied I came to this forum, hehe. As far as I understand the Stata command boxtid tells you the optimum power to which you could transform your problematic predictor, right? I used it and did not make me very happy as I wanted to see graphically how the variable "behaves."

My "problematic" variables is "age", and I would like to know at which ages the trend changes, since I have the suspicion that it first increases the probability of the outcome, then it decreases it, and then it increases it again! So I though a graphical way would be the best. But I have no idea as to how I could plot the logit of Y in Stata.

What about predicting the probability of Y with "predict" and then a scatter plot of that prediction and the problematic variable?

Thanks!
 

Dason

Ambassador to the humans
#4
Are you doing a logit regression or a logistic regression?

If you're doing logistic regression note that technically it's not the logit of the observations that is assumed to be linear. It's the logit of the expected value of the observations that is supposed to be linear.
 

noetsi

Fortran must die
#5
As far as I understand the Stata command boxtid tells you the optimum power to which you could transform your problematic predictor, right? I used it and did not make me very happy as I wanted to see graphically how the variable "behaves."
I don't think it addresses statistical power although I could be wrong. I think it tells you, well suggests to you, if you have misspecified a functional form [as with a non-linear term being specified as linear] or left out a term. Of course the leaving out a term is somewhat suspect. You will likely always leave out some term in any model, which is why the terms to be specified should be tied to theory not a test like this.

If you think that the relationship is non-linear [as you suggest above] you should do a box-tidwel to test this before you turn to graphing. Seeing something graphical is always iffy - it is very subjective. It is better IMHO to have a formal statistical test such as box-tidwel as an alternative to interpreting a graph [if a reasonable test exists].

Unfortunately I use SAS not STATA, but looking through various comments about STATA it clearly has a lot of graphing capacity for logistic regression. I would look in the documentation on line for STATA for this.
 
#7
If you suspect non-linearity from this one solution is to create an interaction term of the predictors times the natural log of that predictor. Then use these interaction terms (and the original predictors) to predict the original DV. If any of the terms are found to be signficant it suggests that term may be non-linear in predicting the DV.
Please, let me make this clearer for me. You suggest (correct me if I am wrong) the following: logit(Y) = a + b1*x1 + b2*x2 + b3*(x2*ln(x2)) + e, where "x2" is the seemingly problematic predictor. If either "b2" or "b3" are statistically significant (both or just one of them) then the predictor "x2" may be non-linear in predicting Y, correct?

Are you doing a logit regression or a logistic regression? If you're doing logistic regression note that technically it's not the logit of the observations that is assumed to be linear. It's the logit of the expected value of the observations that is supposed to be linear.
Thanks a lot for the comment.

A regular on this forum once suggested adding the term: x*log(x) to the model and if significant than there is a breach in the linearity. I have not seen a source confirming this approach
Thanks for the link, noetsi seems to be the one who suggested it :)

Cheers everybody, so far this is being very helpful
 
#9
I see... Anyways I know for sure my model has some problems since "linktest" and RESET test suggest so, but I do not know exactly which predictors are the problematic ones (I have a good idea following theory, though).

Nevertheless, I would like to see graphically this, for I think it is easier this way to decide which transformation is best. So getting back to my original question: how could I do this? I was thinking in plotting the log of odds against the predictors, but I need your confirmation. I was thinking in the following STATA code:

Code:
logit depvar invar1 invar2 invar3 [COLOR="green"]//Logit regression[/COLOR]
predict probs [COLOR="green"]//Postestimation command to save predicted probabilities of the dependent variable[/COLOR]
gen logit = probs / (1-probs) [COLOR="green"]//Generating new variable logit[/COLOR]
plot logit invar2 [COLOR="green"]//Graphing the relationship[/COLOR]
My question is whether this procedure seems correct to you, guys. Thanks a lot!!
Best,
P.
 

maartenbuis

TS Contributor
#10
It will show a linear relationship, because that code just reproduces the linearity assumption rather than show the data. So that is not going to do what you want.

You may want to look at partial residuals. In Stata you would do this like so:

Code:
glm foreign price mpg, link(logit) family(binomial)
predict double w_resid, working
gen double p_resid_price = _b[price]*price + w_resid
scatter p_resid_price price
Notice that we need to estimate the logit model using glm rather than logit.
 
#11
Thanks, maartenbuis. I tried your code and the resulting graph can be seen in the attachment. If that graph is "the one", what transformation would you suggest? Because I was expecting something rather different (more like a no simple no monotonic curve increasing first, decreasing aftewards, and slightly increasing at the end, haha).

Regarding my code... I am not very sure if I have understood your statement. I will wait for your next reply for this :p

In the mean time, what about this one:

Instead of doing "predict pr" and then "gen logit = pr / (1-pr)" I was thinking in:

Code:
gen logit = log(pr / (1-pr))
Theoretically speaking, I am not sure if that "log" ought to be there in order to get the logit of the outcome variable.

Thanks a lot!
 
#12
For some reason my previous message does not appear (perhaps it needs to be moderated because I included an attachment?). Anyways...

Thanks again maartenbuis for your help. I tried out your code and the resulting graph will appear on my moderated comment (whenever that happens, hehe). In the mean time I wanted to comment on a couple of things I was pointed out about earlier:

1) The logit I stated above is not correct. The logit should be:

Code:
ln(p/(1-p))
2) To graph the relationship I would like to get one should plot the predicted linear results (in Stata with command "predict namevar, xb") and the observed pattern. However, I do not know how to get this "observed pattern" after the logit command in Stata. So for example, "plot xb xvar" will not work.

Best,
P.
 

maartenbuis

TS Contributor
#13
You can predict the predicted logits directly by typing:

predict logit, xb

Anyhow you cannot observe the probabilities directly, but you can bin your explanatory variable and compute proportions. Here is one way of doing so in Stata.


Code:
sysuse nlsw88, clear

logit union ttl_exp
predict logithat , xb

gen binned_ttl_exp = floor(ttl_exp/2)*2
bys binned_ttl_exp : egen pr = mean(union)
gen logitobs = logit(pr)

twoway line    logithat ttl_exp, sort || ///
       scatter logitobs binned_ttl_exp
 

noetsi

Fortran must die
#14
The source I have is not entirely clear, but I assume if either the interaction term or its associated term (that is b1X1 and b1logx1) are signficant you can assume non-linearity. Another poster stated on this forum that you could just as easily use the square of X1 as the log of X1 to test this and that may be true. I have yet to find a really good treatment of Box Tidwel.
 
#15
The source I have is not entirely clear, but I assume if either the interaction term or its associated term (that is b1X1 and b1logx1) are signficant you can assume non-linearity. Another poster stated on this forum that you could just as easily use the square of X1 as the log of X1 to test this and that may be true. I have yet to find a really good treatment of Box Tidwel.
Regarding "how to find non-linearity issues" I think a Wald test after "chopping down" your continuous variable is and easier way to see this (in Stata with the command testparm and testing the null hypothesis that the coefficients are not simultaneously equal to zero).

And regarding Box-Tidwell test I found the other day a paper (quite old, though) that you may find interesting. It's called "Transformations of the Explanatory Variables in the Logistic Regression Model for Binary Data" by Richard Kay and Sarah Little. They assert that there is an issue with the Box-Tidwell approach. I copy/paste what they conclude:

". . . the family of power transformations is not wide enough to incorporate transformations that could be required in quite common settings, for example log(1-x) which may be required if X given y has a beta distribution."

Perhaps a bit out of context if one does not read the whole paper.

Thanks so much guys. Will report more issues in this thread if they arise :D

Best,
P.
 

noetsi

Fortran must die
#16
A wald test for logistic regression is a test of signficance for a parmater (similar to a t test in linear regression) and that is what I was really pointing out. If a parameter or its interaction term is significant in the wald test it suggests non-linearity.

Good luck :p
 

m.o

New Member
#18
Why not run a Box-Tidwel test instead. You don't have to rely on graphical methods for this.

http://www.ats.ucla.edu/stat/stata/webbooks/logistic/chapter3/statalog3.htm

If you suspect non-linearity from this one solution is to create an interaction term of the predictors times the natural log of that predictor. Then use these interaction terms (and the original predictors) to predict the original DV. If any of the terms are found to be signficant it suggests that term may be non-linear in predicting the DV.

Hi,

I am using SAS for my logistic regression analysis and don't seem to know the code for the boxtid test. I already did the link test and it shows that I have a specification error in my model, but I also do not know which RHS variable has the issue or which variables i should interact. Any help?

Thanks,