+ Reply to Thread
Results 1 to 10 of 10

Thread: Standard use of dummy variables fails in logistic regression

  1. #1
    Points: 1,529, Level: 22
    Level completed: 29%, Points required for next Level: 71

    Posts
    6
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Standard use of dummy variables fails in logistic regression



    I'm using a logistic regression to, among other things, adjust for the probability of a piece of evidence being true depending on its source. I have 13 different sources of evidence. Using standard dummy-variable techniques, I made 12 dummy variables, and let all zeroes for those 12 variables represent the most-common evidence source (SP). I use the R command

    glm0 <- glm(right~ SP +NMPDR+NMPDR_member +PRK_Reviewed+PRK_Reviewed_member +PRK_Validated+PRK_Validated_member +P
    RK_Provisional+PRK_Provisional_member +char_curated+char_trusted+char_uniprot +manatee +len+len2+len3 +perid+perid
    2+perid3 +hmm0+hmm1 +COILED0+COILED1 +SIGNAL0+SIGNAL1 +TRANSMEM0+TRANSMEM1 +ACT_SITE+BINDING+CROSSLNK+DISULFID +ME
    TAL+DNA_BIND+MOTIF+NP_BIND+ZN_FING, family=binomial(link="logit"), na.action=na.pass)

    This results in the following regression coefficients (leaving out all but the intercept and the dummy variables):

    Intercept => -3.384079,
    NMPDR => -0.1837298,
    NMPDR_member => 0.4721675,
    PRK_Reviewed => 1.129066,
    PRK_Reviewed_member => 1.713695,
    PRK_Validated => 1.283730,
    PRK_Validated_member => 9.173904,
    PRK_Provisional => 0.3952066,
    PRK_Provisional_member => 1.033555,
    char_curated => 0.4861278,
    char_trusted => 1.648290,
    char_uniprot => 0.8621442,
    manatee => 1.050488,

    I know how reliable all these sources are, so I know roughly what the values should be, and they all look right relative to each other. The problem is that the "all zeroes" case, SP, is the second-most-reliable source, just after PRK_Validated_member; I know this for certain, based on many years of experience and many different methods. So most of these coefficients should be negative, to indicate the source is less reliable than SP.

    If I add a 13th dummy variable for SwissProt, I get results that look correct for all variables:

    Intercept => -4.160557,
    SP => 2.605667,
    NMPDR => 0.4771072,
    NMPDR_member => 1.254777,
    PRK_Reviewed => 1.812168,
    PRK_Reviewed_member => 2.568408,
    PRK_Validated => 1.958619,
    PRK_Validated_member => 9.942846,
    PRK_Provisional => 1.081577,
    PRK_Provisional_member => 2.114823,
    char_curated => 1.496059,
    char_trusted => 2.326734,
    char_uniprot => 1.942905,
    manatee => 1.724580,

    But every single text on regression has dire warnings against using as many dummy variables as category values! Doing so means you have 1 more variable than you need to describe all cases. Supposedly this will have bad effects. But the data clearly shows that using 13 dummy variables gives good results; using only 12 gives bad results.

  2. #2
    Points: 2,198, Level: 28
    Level completed: 32%, Points required for next Level: 102

    Posts
    275
    Thanks
    0
    Thanked 1 Time in 1 Post
    What you're doing is called overparameterization. You lose an extra degree of freedom, but the interpretations become easier. Using (k-1) indicator variables for a categorical variable with k levels makes the interpretation of the coefficients relative to the level that you didn't code.

    Overparameterization is very common. It isn't used often in OLS (because the interpretations are simple), but in generalized linear models it's used more often than not.

  3. #3
    Points: 1,529, Level: 22
    Level completed: 29%, Points required for next Level: 71

    Posts
    6
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by squareandrare View Post
    What you're doing is called overparameterization. You lose an extra degree of freedom, but the interpretations become easier. Using (k-1) indicator variables for a categorical variable with k levels makes the interpretation of the coefficients relative to the level that you didn't code.
    Yes, I know that's what's supposed to happen. But it didn't. Look at the numbers. The interpretation with k-1 variables is NOT relative to the level that I didn't code. Why not?

  4. #4
    Points: 1,480, Level: 21
    Level completed: 80%, Points required for next Level: 20

    Posts
    6
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I guess by letting all zeros for the other 12 variables to represent SP, you cannot separate the grand mean with the SP effect. Your intercept of the model is not just the SP effect, grand mean is also there. Your estimates from the second model show that the grand mean (this time, intercept is your grand mean because you have another dummy variable for SP) is largely negative (-4.16). I think this is why the estimate of the intercept of your first model (SP effect and grand mean) is lower than other expected.
    The estimates of the dummy variables in your first model tells you the relative effects of all those sources compared to the grand mean and the SP effect, not the SP effect alone.
    The estimates of the dummy variables in your second model tell you the relative effects of all those sources (including the SP) compared to the grand mean.
    I think if you want to compare all the other 12 sources with SP, you may want to have a dummy variable for SP. If you want to avoid overparametrization, you might also want to exclude intercept in your model. There is a option in glm to do that. I don’t quite remember how, you might want to look at the R documentation.
    I hope I make this clear.

  5. #5
    Points: 1,529, Level: 22
    Level completed: 29%, Points required for next Level: 71

    Posts
    6
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by zyk View Post
    I guess by letting all zeros for the other 12 variables to represent SP, you cannot separate the grand mean with the SP effect. Your intercept of the model is not just the SP effect, grand mean is also there. Your estimates from the second model show that the grand mean (this time, intercept is your grand mean because you have another dummy variable for SP) is largely negative (-4.16). I think this is why the estimate of the intercept of your first model (SP effect and grand mean) is lower than other expected.
    I understand the intercept in the first model is grand mean + SP effect. I'm not bothered that the intercept is negative. But note that the coefficient for sources other than SP is positive, even when those sources are much worse than SP. I would expect any source that is known to be reliably worse than SP, to have a negative coefficient. Only sources better than SP should have a positive coefficient in the first regression.

  6. #6
    Points: 1,529, Level: 22
    Level completed: 29%, Points required for next Level: 71

    Posts
    6
    Thanks
    0
    Thanked 0 Times in 0 Posts
    BTW, the intercept is negative because there are 2 other variables in the regression that I didn't show, which are represented as numbers ranging from 0 to 1. The probability of success is about .5 when these values are each around .6 . It's 0 when they are 0. The intercept is the output if all the variables are zero.

  7. #7
    Points: 1,480, Level: 21
    Level completed: 80%, Points required for next Level: 20

    Posts
    6
    Thanks
    0
    Thanked 0 Times in 0 Posts
    My understanding of your model 1 and its estimates is:
    Without information from the other 12 sources, the mean probability is inverse(logit(-3.38)) = 0.015. Adding information of another source (say NMPDR_member), the prob becomes inverse(logit(-3.38+0.47)) = 0.025. That is, adding NMPDR_member, you prob. increase from 0.015 to 0.025. However, this does not tell you how much more reliable NMPDR_member is compare to SP, right? It just tells you that, given SP, adding NMPDR_member would improve the probability (or reliability if you like). That is, you are comparing NMPDR_member+SP to SP, and of course NMPDR_member+SP is more reliable than SP (assuming NMPDR_member is not a negative source).
    I guess, if you want to compare SP to NMPDR, you will need to be able to separate SP from the grand mean like you did in your model 2.

  8. #8
    Points: 1,529, Level: 22
    Level completed: 29%, Points required for next Level: 71

    Posts
    6
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by zyk View Post
    However, this does not tell you how much more reliable NMPDR_member is compare to SP, right? It just tells you that, given SP, adding NMPDR_member would improve the probability (or reliability if you like). That is, you are comparing NMPDR_member+SP to SP, and of course NMPDR_member+SP is more reliable than SP (assuming NMPDR_member is not a negative source).
    SP and NMPDR_member are mutually exclusive. That's the standard usage of dummy variables. NMPDR_member=1 means not SP. (Were there a SP dummy variable, it would be set to 0.)

  9. #9
    Points: 1,480, Level: 21
    Level completed: 80%, Points required for next Level: 20

    Posts
    6
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I am just curious, could you show us your data?

  10. #10
    Points: 1,529, Level: 22
    Level completed: 29%, Points required for next Level: 71

    Posts
    6
    Thanks
    0
    Thanked 0 Times in 0 Posts

    It's 200M, so that would be hard.

+ Reply to Thread

Similar Threads

  1. Replies: 4
    Last Post: 11-06-2010, 01:05 PM
  2. Regression with dummy variables
    By casdelong in forum Statistics
    Replies: 2
    Last Post: 04-20-2009, 02:48 AM
  3. Dummy variables and regression
    By Jana in forum Probability
    Replies: 1
    Last Post: 08-08-2008, 03:40 AM
  4. Dummy Variables and Multiple Regression
    By janeduluth in forum Regression Analysis
    Replies: 3
    Last Post: 03-20-2008, 03:09 PM
  5. Dummy variables, standard deviation, standard error
    By Fabio Pieri in forum Statistics
    Replies: 1
    Last Post: 02-04-2008, 09:56 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts








Advertise on Talk Stats