Logistic Regression -some expert advice !

Hi guys

I've recently worked on a project involving Logistic Regression. Although I've managed to 'complete' the project, I'm unsure whether what I have done so far is correct and/or sufficient. It is a case of regular binary logistic regression. One dichotomous response variable and several (approx 30) independent variables.

I basically used the proc logistic code to get the output and removed all variables for whom p value was > 0.05.

However, I was wondering if someone was able to provide a list of steps that need to be performed in Logistic Regression from start to finish (don't need to detail them much as I will try to find more about them on my own) in a list. e.g. do I need to perform diagnostics of some sort before running up logistic regression? etc. I read about correaltion matrix and frequency tables etc but I don't know if and why i need all that and much more perhaps?

Also, is there a good dedicated clear source (a book or online source) that explains every little detail about using SAS for logistic regression including the code and how to read the output?

I've consulted multiple resources (many books) but none has been "complete" or detailed enough in providing the information I need.

All help is appreciated.


No cake for spunky
The best book, none is perfect, for SAS and logistic regression I have found is Paul Allison's Logistic Regression Using SAS (2nd ed).

These are the steps I think are neccessary (many are going to disagree which is the norm in stats).:p

1) Create your research hypothesis and determine your variables. Decide if you will test for interaction. Inspect the data for missing values and decide if you will substitute for them or just delete them.
2) Run the model (some use stepwise to do this, but I agree with the logic of those who feel this is a flawed approach).
3) Look the the various results that interest you. The Hosmer-Lemeshow test should be among them (although Allison does not like it) simply because there are few good alternatives and it is commonly used.
4) Run you diagnostics. Normality and and equal error variance are not assumed so don't test for them. Look at linearity (for interval data). You have to know from you design if there is independence - no diagnostic tests for this. Use linear regression to test for multicolinearity. Look for outliers and (if you are ambitious) leverage which SAS has diagnostics for.
5) If you have interaction you need to decide how to deal with this. Allison has a signficant section on this.

30 variables is a lot unless you have a huge data set. You might use theory or factor analysis to limit this some. The problem with throwing out variables that are less than .05 is that 1) you lose power I believe by including them in the first place 2) you may generate family wise error by using the same data set to run multiple tests. It is far better to remove them before you run them then use this method.

Incidently I am not an expert - just some suggestions. I spent months going through books and SAS documentation to get the code I use (and asking the true regression experts here such as Dason and Jake - although they use R not SAS).
noetsi, thanks again for your post. I think i have the first ed of the book and it did have some good examples but I dont remember it covering how to use simple diagonostics in the early stages of analysis before you even decide on whether logistic regression is good for your data set or not.

I decided to use logistic just because my dependent variable is binary. Most other variables are continuous or categorical. There are 14000 observations and 30 variables so Im hoping its a good dataset.

I read somewhere about people creating a correlation matrix but I didn't understand

1.) Why
2.) Would that be for all variables at the same time? like a correlation matrix of all 30 variables?

I used Proc Logistic (I think you helped me with that one in my other post) and then basically based on p values took stuff out. Isn't this reverse of stepwise regression, instead of adding them step by step, I took them out step by step and kept re-running the logistic regression to see any changes in output -Nothing really changed which was a concern for me. The ROC curve was 99% from the very beginning and I reduced down to approx 15 variables and still no change.

Also, when you say "decide if you will test for interaction". Which interaction are you talking about -amongst the variables? i.e. multicollinearity ?


No cake for spunky
The 2nd edition is much better than the old book as the original book's SAS code is badly out of date.

There is no discussion of diagnostics to decide whether to use logistic regression or not, because the situation is usually pretty cut and dried. If your data is categorical and you want to do regression (which many do) than logistic regression is the way you go. If you want a discussion of logistic regression, including why to do it and when not, you might look at Tbachnic and Fidel "Using Multivariate Statistics" 5th ed particularly the first few pages of chapter 10 when they compare it to other methods.

I think that the correlation matrix might refer to using it in exploratory factor analysis to generate factors to run in the regression or to look at which variables contriubte the most to the variation in common in the variables. If so then you would place all the variables in the matrix.

What you describe is not I belive the reverse of stepwise - which uses the amount of explained variance to decide which variables you should include in the model before you run the regression I believe. The problem with stepwise is that it is heavily sample dependent and what you include in one sample might be very different in another. Also if the predictors are highly correlated what gets left in and out in stepwise is problematic [and to some extent illogical]. Backward stepwise includes all the variables then throws them out, forward adds them one at a time until there are none left that meet a criteria.

I am not certain what the practical and theoretical differences are between using stepwise and removing variables that are signficant and rerunning the model but the sense I get is that the former (stepwise) is much more criticized than the latter. You do have the problem I mentioned with family wise error if you run model after model with the same data. It will look like you ran the model a single time, in fact you have run it mulitple times potentially to get your final results and thus the true type I error will be much higher than your nominal type I error (the alpha level you use to reject or accept the null - usually .05).

If you had 30 variables and got the same results (in terms of fitting the data or how you judged model adequacy) as 15 then this suggests the 15 that droped out did not provide much value. Try rerunning the 15 you decided to drop against the DV and see if the overal model test or the individual chi square tests of the variables was signficant.

You should test for multicolinearity, but that is not the same thing as interaction. Say you have gender and age as predictors. Regression assumes that the impact of age on the DV does not vary at specific levels of gender. If this is not true you have interaction. You test for this by creating a new variable AGE*GENDER and seeing if it is statistically signficant. If it is you have interaction. Which makes analysis much more complex - but which you should do if you have reason to believe that certain variables will interact. Commonly you do this because you have a theory, or the literature suggests one, of interaction.