# Thread: Logistic regression diagnostic questions

1. ## Logistic regression diagnostic questions

I have a set of 42 variables which I am trying to determine the relative importance of. Well I already did but now I am trying to determine if non-linearity and Multicolinearity (MC) are issues. And how to fix them if they are. I have about 450 cases.

1) Non-linearity. I have a test for this discussed in Fidel and Tbarnich (they did not create it, they simply discuss it) which involves creating a set of new interaction variables one for each of the existing variables. If the new variables are significant then the original variable associated with it (that it was created from) is non-linear.

The problem is that this would create a list of about 84 variables to run. I am not sure the data I have contains enough cases for that. But I am not sure if I can test for non-linearity of a set of variables I think is important by running subsets of them (that is say run 11 of the original variables, then 11 more etc). Is non-linearity (which is between the DV and IV) influenced by the specific set of IV in the model or not? If it does not matter which IV are in a model when you test for this I will just run subsets of the IV.

2) In terms of MC I can run tolerance test. My question is, once you you determine a specific variable fails this test, does it mean that those that did not are ok, or could one of the variables that fail the test be involved in MC with a variable that does not? I guess what this is actually asking is for tolerance or VIF test are the only variables you have to worry are the ones that fail it and how do you know specifically which variables those that fail such a test are involved with in terms of MC

2. ## Re: Logistic regression diagnostic questions

42 variables and 450 observations. If you want all variables in the model then I have only one thing to say "God bless noetsi."

1) How are you doing the test of nonlinearity for logistic regression? Are you doing it on logodds?
2) What is the tolerance test?

Multicollinearity literature is more on linear regression. Normally people follow linear regression MC analysis for logistic regression. Usually take the cut off for VIF between 2.5 to 10 (This judgement based on the nature of variable in the model). One way to identify the correlated vector by looking at eigen vectors analysis (in PROC REG option collin).

3. ## The Following User Says Thank You to vinux For This Useful Post:

noetsi (03-18-2013)

4. ## Re: Logistic regression diagnostic questions

Always asking good questions. Can you provide a full citation for the Fidel and Tbarnich paper? Can you also better describe why non-linearity is important in logistic regression? Also, depending on the content of Fidel and Tbarnich paper - I may have a follow-up email about conducting it in SAS.

Yes you do have quite a few variables are all of these continuous or are some categorical? If so, are you taking the approach of running them through proc reg (SAS) to examine MC?

Hmm, Vinux, - not sure if I usually use the Collin option. Will need to investigate. I know many times I use "TOL" and "VIF".

5. ## The Following User Says Thank You to hlsmith For This Useful Post:

noetsi (03-18-2013)

6. ## Re: Logistic regression diagnostic questions

I realized after talking this over on the chat that I had not addressed very well what I was aking.

So before I address the questions above I wanted to clarify. For the linearity question this is what I am asking. Say I had a dependent variable Death (you die or you don't) and I was predicting this with a predictor variable Age. I find the relationship is linear. Now I add a new predictor variable called Health. Can Age's relationship to Death become non-linear as a result? Essentially I am trying to find out if linearity only involves a relationship between the DV and a specific predictor variable (in terms of that relationship) or is this relationship influenced by other predictors in the model.

Now to the questions above.

Tolerance is just the inverse of VIF. I used VIF in the end. All my variables were below 5 (10 seems to be the most common reccomendation) but not all below 2.5. I found suggestions that you need to look at model strength (reflected by R squared) in terms of chosing which VIF to use, but since pseudo R squared values in logistic regression are not directly comparable to linear R squared (they are commonly smaller) I did not try that.

How do you use eigenvectors (which I already generated in EFA) to address this?

Its not in a Fidel and Tbarnich paper. Its in their statistical text "Using Multivariate Statistics" 5th ED. They call it Box-Tidwel approach (apparently Hosmer and Lemeshow created it). Its on 474 of their book. It looks like they are creating an interaction term between the predictor and its natural log. If the test of the interaction term is signficant it suggests linearity. Note that as I wrote this I don't think this will work for my variables. They are looking at interval predictors while mine are dummy variables.

I dont think you can a natural log variable for a variable taking on two levels can you?

7. ## Re: Logistic regression diagnostic questions

Edit. Is is true that linearity is not assumed in logistic regression for dummy predictor variables? Arggg I never knew that if so. Never came up in any class or reference I read....

Might as well ask this here rather than do a new thread. I wanted to use Box-Tidwell which detects non-linearity in regression (including logistic regression). However it generates the natural log of a predictor variable to test. All my variables are dummy variables coded 0 and 1. That raises two questions:

My variables are coded 0 and 1 and there is of course no natural log of 0. Can I simply add 1 to each of the two levels (so I end up with 1 and 2) without creating problems for the logistic regression? That is whether you run logistic regression using the values of 1 and 2 will generate the same results as using 0 and 1 (I think so....).

Can you generate a natural log variable for a non-continuous variable? I am not sure this method works for categorical variables.

8. ## Re: Logistic regression diagnostic questions

It doesn't make sense to talk about a dichotomous variable being non-linear so you don't really even need to test categorical variables...

9. ## The Following User Says Thank You to Dason For This Useful Post:

noetsi (03-18-2013)

10. ## Re: Logistic regression diagnostic questions

Is this any different if the dependent variable is continuous? That is if you have a dummy variable as a predictor with a continuous dependent variable would you need to address linearity?

I don't think so given your explanation, but then I have been wrong before...

11. ## Re: Logistic regression diagnostic questions

Nope - doesn't matter. We're still only talking about modeling two points so the relationship can be modeled by a line no matter what...

12. ## Re: Logistic regression diagnostic questions

Well you could recode you're (0,1) variable to (1,e) and then take the natural log getting a variable coded (ln(1) , ln(e)) = (0,1). This argument establishes that the method cannot in general be used to make the test you want to.

What about another recoding of values? A function is a rule that transforms one number into one and only one number. So - Ignoring the case were you transform coding into to one equal value fx (1,1) - taking logs after recoding does not achieve anything recoding could not achieve in itself. And my hunch is that it changes the estimated coefficient in an immaterial way. Imagine a regression with on dummyvariable as DV what happens to coefficient as coding is changed from (0,1) to (0, 0.5). The difference between the two levels is half som the coefficient must double to model the true effect - which doesn't change because you decide to change units.

I learned to test for non-linearity by using squared terms suggesting to me that there is no magic in the log-transform what is essentiel is that you have non-linear function of youre variable. But then were back to the functional argument, any function will be just another binary variable.

I know this is no proof and I cannot cite any article.

13. ## Re: Logistic regression diagnostic questions

yeah okay you guys went at it while I was writing

14. ## The Following User Says Thank You to JesperHP For This Useful Post:

noetsi (03-18-2013)

15. ## Re: Logistic regression diagnostic questions

Your comments still help in general in testing for linearity, even if it does not apply here I assumed there must be some magical reason to use a log - not that any non-linear term would work. Actually squaring seems a lot easier.

16. ## Re: Logistic regression diagnostic questions

noetsi, curious how this is looking in code for SAS. Once you have figured out what you are exactly doing, I would greatly appreciate it if you could post the code and a brief explanation.

17. ## Re: Logistic regression diagnostic questions

P.S., These strings of logistic topics could be the sister paper for the collaborative TS manuscript about linear regression assumptions!

18. ## Re: Logistic regression diagnostic questions

The code for the MC is fairly staightforward

PROC REG DATA=SASUSER.FIELDLOG;
MODEL DDV = DARQ1 DARQ2 DARQ3 DARQ4 DQ1 DQ10 DQ11 DQ12 DQ13 DQ14 DQ15 DQ16 DQ17 DQ18 DQ19 DQ2 DQ20 DQ21 DQ22 DQ23 DQ24 DQ25 DQ26 DQ27 DQ28 DQ29 DQ3 DQ30 DQ31 DQ32 DQ33 DQ34 DQ35 DQ36 DQ37 DQ38 DQ39 DQ4 DQ40 DQ41 DQ5 DQ6 DQ7 DQ8 DQ9
/ SELECTION=NONE
TOL VIF
;

Note that while the project I am working on involves logistic regression, this does not matter in terms of MC which only involves the IV. So you can run your variables in linear regression for the VIF or Tolerance (only you ignore all other results from the linear regression)

For the test of linearity its very very simple. You create a new variable that is the logX*X (where X is an IV you want to test) for each IV you want to test and add them to the model. For example in the following I am testing if DQ1 is nonlinear. testvar would be the log of DQ1 times DQ1

PROC LOGISTIC DATA=WORK.Qlogistics

;
MODEL DDV (Event = '1')=DQ1 testvar /
SELECTION=NONE
;

I dont have any data that I know to be linear to the logit of a variable and not linear so I can't test this but the source I read shows this is the way its done.

http://support.sas.com/kb/30/333.html

19. ## The Following User Says Thank You to noetsi For This Useful Post:

hlsmith (03-18-2013)

20. ## Re: Logistic regression diagnostic questions

Originally Posted by noetsi
Your comments still help in general in testing for linearity, even if it does not apply here I assumed there must be some magical reason to use a log - not that any non-linear term would work. Actually squaring seems a lot easier.

Another more easy test for non-linearity when speaking about multiple regression is to run the regression fx:

[/math] y = \beta_0 + \beta_1 x_1 + \beta _2 x_2 + \epsilon [/math]

save the predicted values [/math] \hat y [/math]and then do

[/math] y = \beta_0 + \beta_1 x_1 + \beta _2 x_2 + \beta_3 \hat y^2 + \beta _4 \hat y^3 + \epsilon [/math]

test significance of transforms of predicted values.... this test does not allow you to locate from where the non-linearity is coming though....
you dont necessarily need [/math] \hat y^3 [/math] how many polynomial terms you include is a matter of >>feeling<< I guess..
Also I don't know how this works with logistic regression... but why shouldn't it work...

21. ## The Following User Says Thank You to JesperHP For This Useful Post:

noetsi (03-18-2013)