Logistic Regression: To dummy code or not to? Which backward elimination to use?

#1
I have 10 binary dependent variables (disease prevalence: yes/no, 1/0). I want to do a logistical regression with each of them with the following binary categorical independent variables:

- Year (8 levels)
- Gender (2 levels)
- Age (11 levels)
- Season (2 levels)
- Location (10 levels)

After hours of Google research I am still confused by certain things. So here are my questions:

(Notes: I use SPSS 20; For the questions lets use 'Year' as the independent variable)

1) Analysis Choice:
- Just to be sure: binary logistic regression is the analysis of choice, right? (Because all the variables are categorical)

2) Dummy Coding:
- When performing LogReg, do I have to make dummy variables or not? I tried it with and without dummy codes and the results are completely different.

3) Covariate Types/Options:
- In SPSS - having desperately tried all options - I noticed there is a difference between marking and not marking variables/covariates as 'categorical'. What exactly is the difference and when should I use which option?

4) Contrast/Reference Category:
- In SPSS you can set the Contrast and Reference Category. I assume that Contrast should remain on 'indicator' (alternatives are: simple, difference, helmert, repeated, polynomial and deviation) since I've never read anything about changing that. Right?
- Reference category means which category you compare it with, right? But that is awfully confusing for me in this setting. Suppose I use 10 dummy variables to express year, what does it mean that I compare '2004=0' with '2004=1' and '2005=0' with '2005=1', etc.? And what I do not use binary dummy variables but have to use the single 'year' variable with its 10 levels?
- To compare all the subsequent years with the first/lowest one (ie: compare 2001-2009 with 2000) I should (obviously?) use 'reference category = first'?

5) Method/Backward Elimination:
- For each of the disease prevalences I have to use backward elimination to reduce the model by eliminating interactions with a p>.15.
- Though I know I have to do it, I do not know which of the three stepwise backward methods I should use (Conditional, LR or Wald).

6) Making Sense:
- A thorough explanation of "why (not)" and "what" of the above issues would be wildly appreciated! This forum is awesome; I fully intend to remain an active lurking or even post in order to increase my stat proficiency :).

Thanks in advance!
 

noetsi

Fortran must die
#2
1) With only two levels of the DV binary logistic regression is a good choice (probably the most common today although alternatives exist like probit). It does not matter in this choice what form you IV are in. I work in SAS so I don't know most of the other questions you asked but you might find this useful.

http://www.ats.ucla.edu/stat/spss/output/logistic.htm

Generally there are two types of coding for categorical variables. Reference coding compares the dummies to one level of the categorical variable that is ommited. The slope for a given dummy is the difference between that dummy and the omitted level. So if the categorical variable is gender and male is the dummy (showing the level of males) the slope will be the mean difference between respondents who are male and female. Effect coding shows you the difference between the dummy variables and the mean of the means all the other levels of the categorical variable. So if you have a variable with five levels, then the level 4 dummy slope will be the difference between the level 4 of the categorical variable and the mean of the means of 3 other levels (one level won't be calculated - SPSS will tell you hopefully which it is). You can always switch which level is omitted

I don't know what SPSS calls effect coding (from your comments it apparently does not use the term effect coding).

You should not use Stepwise or Backward period. It is strongly frowned on (I read one article entitled "Death to Stepwise, Think for Yourself)" :p There are a number of signficant problems with using this method one of which the results can change dramatically from sample to sample another is that with high colinearity the results can be very deceptive. I don't understand what you mean by eliminating interaction, but you should not eliminate signficant interaction terms ever (if that is what you are doing).

Sorry I can't comment more, SAS uses very different forms to do analysis.
 
#3
Thanks! That was helpful. I am glad I at least choose the correct analysis; the more research I have been doing the more I started to doubt myself.

I think I get the difference between dummy coding and effect coding.
- When I want to test the difference between any level and the reference level I would use dummy coding (example: comparing 2001, 2002, 2003, etc. with 2000, because you want to see a difference over the years).
- When I want to test the difference between any level and the average mean I would use effect coding (example: comparing every individual test location with the mean of all the other test location, because you want to see if there is a difference between the locations).
Right?

Anyway, I am still not entirely sure whether or not I should make dummy variables myself. Here's a quote from your link: "If you have a categorical variable with more than two levels, for example, a three-level ses variable (low, medium and high), you can use the /categorical subcommand to tell SPSS to create the dummy variables necessary to include the variable in the logistic regression, as shown below."
Though they are using syntax to do the analysis and I am just sticking to the UI, I do get the feeling that by using the 'categorical' option menu in SPSS it will automatically create dummy variables in a good way.

Concerning backward elimination: I found a published research article with pretty much the same data and setup as I am working with and they do backward elimination. Quote: "For each parasite, the model was reduced by eliminating all interactions that had a p value of >0.15 through backward elimination, leaving a model with all main effects and significant two-way interactions."
Are you sure stepwise/backward is that evil? Do you know when you should use it?

Thanks a lot, you've already (in)directly answered a lot of questions!
 

noetsi

Fortran must die
#4
the more research I have been doing the more I started to doubt myself.
Why should you be any different than the rest of us :p

Effect coding, as jake pointed out to be recently, is not comparing the dummy/design variables it creates to the grand mean of all the values for that categorical variable (that is it is not the average value of all respondents). It is the mean of the other levels (the mean of their means). Unless each level has exactly the same number this will differ from the grand mean - and its rare for each level to have the same number of cases. Other than that I agree with your comments on effect coding (in practice effect coding uses dummies just like reference coding - it just compares them to the mean of the means rather than the reference level).

I don't know SPSS but SAS automatically creates dummy variables for categorical variables. I assume SPSS does the same given your comments. Just be careful to remember which level is being excluded as this may not be obvious unless you look.

There are tons of articles that use backwards and forwards analysis. I would bet you that 99 plus percent of them are not statisticians (I am not a statistician either of course - I simply have read a fair amount on this). You can certainly do it, but before you do I urge you to read a text or article on the limitations of these methods. They can be quite serious. Since I read them I quit using these methods.

I have never seen these methods used to remove interactions only main effects so I can't comment on that. One thing I should point out is that it is considered invalid to have interaction terms in your model and exclude the main effects associated with them. So if your backward analysis omits a main effect but keeps in the interaction effect associated with it you should put the main effect in regardless of what backwards does.

Don't assume just because something is done in a journal its right. I have read analysis of medical journals that points out frightning mistakes made by authors. Being a subject matter expert (which most authors are) does not mean they know the statistics they use. They may well not and simply copied a mistake made by others.
 
#5
Alright, thanks again for the info Noetsi!

I guess my remaining issues are SPSS orientated. I'll just wait until someone comes by who is skilled with SPSS.

Just to clarify, these are the issues I am still uncertain about:
- Should I make dummy variables? Why (not)?
- Should I mark the variable(s) as categorical? Why (not?)
 

noetsi

Fortran must die
#6
1) If you use categorical IV in regression you have to use dummies. SPSS will likely make this choice for you if its anything like SAS. You won't have a choice (because categorical IV - not broken into dummies- lead to nonsensical results including in other IV at least according to my graduate committee).

UCLA has a series of links on interpretating SPSS. You should look at them carefully.