+ Reply to Thread
Page 2 of 2 FirstFirst 1 2
Results 16 to 25 of 25

Thread: Binary Logistic Regression in R

  1. #16
    Points: 2,913, Level: 33
    Level completed: 9%, Points required for next Level: 137

    Posts
    22
    Thanks
    0
    Thanked 0 Times in 0 Posts



    Quote Originally Posted by Rounds View Post
    Are your response columns the same order as mine? First is the absence of the event of interest, second is the presence of the event of interest.
    I see...mine were as shown above: first column (Promoted) is presence of the event, second column (Not Promoted) is the absence....

  2. #17
    Points: 2,913, Level: 33
    Level completed: 9%, Points required for next Level: 137

    Posts
    22
    Thanks
    0
    Thanked 0 Times in 0 Posts
    So at the risk of sounding very dumb, here goes:

    The subgroup coefficient = -0.27978. exp(-0.27978)= 0.756. So this is interpreted as a member of the subgroup is 0.756 times as likely to be promoted as someone who is not...?

    The newMethod coefficient = 0.74351. exp(0.74351)=2.103. So someone evaluated under the new criteria is 2.1 times as likely to be promoted as someone who isn't...?

    Both of these appear significant due to the |z| values...?

    But the relatively small decrease in deviance between this model and the null indicate it's not a very good overall fit...?

    Assuming I'm not way off base, it seems that I may not be treating the new evaluation criteria correctly, since people who are not part of the subgroup are never going to be evaluated under the new criteria. Might it not be more apprpriate to fit the data before the change in criteria (with subgroup membership as the only factor), fit the data after the change, and see if there is a significant change in the subgroup coefficient?

  3. #18
    R purist
    Points: 35,103, Level: 100
    Level completed: 0%, Points required for next Level: 0
    TheEcologist's Avatar
    Location
    United States
    Posts
    1,921
    Thanks
    303
    Thanked 607 Times in 341 Posts
    Quote Originally Posted by jerh View Post
    So at the risk of sounding very dumb, here goes:

    The subgroup coefficient = -0.27978. exp(-0.27978)= 0.756. So this is interpreted as a member of the subgroup is 0.756 times as likely to be promoted as someone who is not...?

    The newMethod coefficient = 0.74351. exp(0.74351)=2.103. So someone evaluated under the new criteria is 2.1 times as likely to be promoted as someone who isn't...?

    Both of these appear significant due to the |z| values...?

    But the relatively small decrease in deviance between this model and the null indicate it's not a very good overall fit...?

    Assuming I'm not way off base, it seems that I may not be treating the new evaluation criteria correctly, since people who are not part of the subgroup are never going to be evaluated under the new criteria. Might it not be more apprpriate to fit the data before the change in criteria (with subgroup membership as the only factor), fit the data after the change, and see if there is a significant change in the subgroup coefficient?
    You don’t sound dumb at all.

    The best way for you to get the model predictions is not to use the equation that you used, rather use (in R);

    predict(modelname, type='response',se.fit=T) #this also gives you the standard errors.

    The model is indeed not a very good fit at all. A simple and straight forward way to test the model fit is by doing this:

    nullmodel=glm(response~1,data=dat, family=binomial)
    #This is the equiv of R-squared
    1-(logLik(yourmodel)/logLik(nullmodel))


    Why don’t you first test for a difference in maingroup - subgroup and then test the subgroup - newcriteria separately (without the maingroup)? you should especially do this if this improves the model fits.

    Maybe its a good idea to first write down your hypotheses clearly and in a testable manner (one to two sentences). This way we might be able to decide what the best course is.
    The true ideals of great philosophies always seem to get lost somewhere along the road..

  4. #19
    R purist
    Points: 35,103, Level: 100
    Level completed: 0%, Points required for next Level: 0
    TheEcologist's Avatar
    Location
    United States
    Posts
    1,921
    Thanks
    303
    Thanked 607 Times in 341 Posts
    Quote Originally Posted by TheEcologist View Post
    Why don’t you first test for a difference in maingroup - subgroup and then test the subgroup - newcriteria separately (without the maingroup)? you should especially do this if this improves the model fits.

    Running the analysis like this I find no difference between the maingroup and the subgroup (not in sig nor in AIC scores between null and model). Ergo no difference.

    While within the subgroup (analysis without maingroup) however I find a sig. difference between the new criteria and the 'rest'. Model fit improves to 0.19
    The true ideals of great philosophies always seem to get lost somewhere along the road..

  5. #20
    Points: 2,913, Level: 33
    Level completed: 9%, Points required for next Level: 137

    Posts
    22
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Thanks for all your help. I'm about to get yanked into something that I suspect will take the rest of today, but over the weekend I'm going to see try following your suggestions for this data set and the two others I have...hopefully next week I can post something more meaningful with regard to interpretation.

    Thanks again!

  6. #21
    Points: 2,913, Level: 33
    Level completed: 9%, Points required for next Level: 137

    Posts
    22
    Thanks
    0
    Thanked 0 Times in 0 Posts
    That other task is taking significantly longer than I expected

    I managed to recreate your results with regard to main group vs subgroup, and subgroup before and after change. I have some concerns though with the way I'm looking at the data. Earlier you had mentioned trying to clearly state the question I'm trying to answer. Maybe that will help.

    The question, as posed to me, was "Did the evaluation changes made in 2005 improve the promotion rate of candidates in the subgroup as compared to candidates from outside the subgroup?" To me, the requisite first question is "was there a significant difference in the promotion rate between the two groups before the evaluation changes?" I think the answer to that question is no, but I need to be able to clearly support that.

    Here's my big concern. Each of the aggregate data sets represent the results of an annual (once or twice a semiannual) promotion review board. Promotion rates vary across the entire population from board to board, so what we're really looking at is the difference between the rate for the subgroup and others within a given board. So when I aggregate all the data, don't I lose that correlation?

    For instance, when running the regression on just the subgroup using the before/after factor, there's no accounting for the fact that in those two years that the new criteria were in place, the promotion rates across the entire population were higher than the average, right? So the new criteria factor looks like its associated with a better chance of being promoted, but it may just as well be that in those two years more people were being promoted full stop. Would this be (and I'm scared to type these words) some kind of multinomial regression where year becomes a factor with multiple levels?

    You guys really should get paid for reading all this....

  7. #22
    Points: 2,913, Level: 33
    Level completed: 9%, Points required for next Level: 137

    Posts
    22
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I think I finally exhausted everyone's patience!

    Just out of curiousity, before anyone mentioned binary logistic regression, my first stab at these numbers looked like this:

    Code: 
    Population Promotion Rate          Subgroup Rate          Delta
    87.9%                                     89.3%                    1.3906%
    88.5%                                     85.6%                    -2.8324%
    92.5%                                     92.4%                    -0.0566%
    92.4%                                     92.3%                    -0.1837%
    93.7%                                     87.9%                    -5.7765%
    93.0%                                     91.5%                    -1.5063%
    92.6%                                     90.8%                    -1.8585%
    93.2%                                     94.9%*                   1.7134%
    94.2%                                     94.9%*                   0.7178%   
    
    * = new criteria in place
    
    First 7 cycles (old criteria)
    Delta mean: -1.5462%
    std dev: 0.02321958
    95% CI on delta mean [-3.266%, 0.174%]  Contains 0, can't say they're different
    90% CI on delta mean [-2.9897%, -0.1026%] Doesn't contain 0, so you can 
    say the delta is negative at the 90% confidence level
    
    Last 2 cycles (new criteria)
    Delta mean: 1.2156%
    std dev: 0.007039628
    95% CI on delta mean [0.2400%, 2.19121%] Doesn't contain 0, so you can
    say the delta is positive at the 95% confidence level
    While this is the answer that I know management wants -- the subgroup's rate was below the rest of the population's before, and is above now -- I'm really uncomfortable making such a statement based on 7 and 2 data points respectively. The confidence intervals were constructed using Excel's confidence function....

  8. #23
    Points: 2,940, Level: 33
    Level completed: 27%, Points required for next Level: 110

    Location
    Washington, DC
    Posts
    100
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I think your concerns are very valid. you have to choose between assuming there's no difference between years and then pool the data, or group it (e.g let X= mainrate-subgrouprate) and then deal with only 9 observations. The latter is, I think, clearly insufficient, so let's think about if you can reasonably assume away inter-year differences. Unfortunately, you can't claim that there's no omitted variable bias because the omitted variable is uncorrelated with the regressors, since year is clearly correlated with new criteria. So you need to be able to argue that year is insignificant. Since you've got a bunch of individual data for each year, you could try this: Drop the last two years. Then run a logit of promotion (yes or no) on year dummies and a group dummy. If the year dummies are jointly insignificant, you could argue that year doesn't really matter, at least in the first 7 years. Then you can argue that a significant coefficient on newcriteria in a logit regression w/ just subgroup and new criteria (which should be 1 for everyone in those years, btw) is actually due to new criteria. It's not water-tight, but it's not unreasonable. Just basing it on 7 vs 2 is unreasonable, I'd say.

  9. #24
    Points: 3,489, Level: 36
    Level completed: 93%, Points required for next Level: 11

    Posts
    154
    Thanks
    0
    Thanked 0 Times in 0 Posts
    He has plenty of data because he has atleast 2 promotion boards per a factor level combination, and each has observations in the 100s.

    The reason I didn't want to mess with it before is because promotion board isnt a fixed effect with 9 levels. Its a random effect and all of this is nested in it.

  10. #25
    Points: 3,489, Level: 36
    Level completed: 93%, Points required for next Level: 11

    Posts
    154
    Thanks
    0
    Thanked 0 Times in 0 Posts

    So I was still interested in this problem, but for the wrong reasons. I believe this is a GLMM which can be fit by lmer in the lme4 package, that made it interesting to me. But the wrong reasons are I have no experience with GLMMs, lmer or lme4. So that's not helpful.

    But I thought about what I would do if someone held a gun to my head.

    My model is fixed effects for subgroup and newcriteria levels. And random effects for intercept nested in cycle .

    That leads to the model in R of:
    promoted ~ subgroup+newcriteria + (1|cycle)

    My code was this :
    Code: 
    library(lme4)
    setwd('C:/Users/Jeremiah/Documents/Fall08/Research/lme/')
    data = read.table("data.txt")
    colnames(data) = c("cycle", "promoted", "notpromoted", "subgroup", "newcriteria")
    data$cycle = factor(data$cycle)
    data$subgroup = factor(data$subgroup)
    data$newcriteria = factor(data$newcriteria)
    summary(data)
    response = cbind(data$promoted,data$notpromoted, ) #first one is success
    fit1 = lmer(response ~ subgroup + newcriteria + (1|cycle), data=data, family=binomial)
    And that leads to summary
    Code: 
    Generalized linear mixed model fit by the Laplace approximation 
    Formula: response ~ subgroup + newcriteria + (1 | cycle) 
       Data: data 
       AIC   BIC logLik deviance
     44.08 47.64 -18.04    36.08
    Random effects:
     Groups Name        Variance Std.Dev.
     cycle  (Intercept) 0.056011 0.23667 
    Number of obs: 18, groups: cycle, 9
    
    Fixed effects:
                 Estimate Std. Error z value Pr(>|z|)    
    (Intercept)   2.46866    0.08453  29.206   <2e-16 ***
    subgroup1    -0.21176    0.09702  -2.183   0.0291 *  
    newcriteria1  0.46537    0.25136   1.851   0.0641 .  
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
    
    Correlation of Fixed Effects:
                (Intr) sbgrp1
    subgroup1   -0.109       
    newcriteri1 -0.002 -0.381
    Based on that you have statistically significant evidence that subgroup 1 is less likely to be promoted, but you are marginal on evidence that the new criteria is a non-zero fixed effect. This is kind of a more data situation another cycle might just tilt it one way or another.

    I am not really sure about diagnostics. Like I said new turf for me. Seek out an expert on GLMMs.

    BTW that number of observations line is kind of misleading in this context. Consider:

    Code: 
    #reshape to help lmer out
    #one observation per a row
    newdata = matrix(0, sum(data$promoted) + sum(data$notpromoted), ncol(data)-1)
    index = 1
    for(i in 1:nrow(data)) {
           promoted = data$promoted[i]
           notpromoted = data$notpromoted[i]
           newrow = as.numeric(as.matrix(data[i,]))
           newrow = newrow[-3]
           newrow[2] = 1 #level for promoted
           for(j in 1:promoted){
                  newdata[index,] = newrow
                  index = index + 1
           }
           newrow[2] = 0 #level for notpromoted
           for(j in 1:notpromoted){
                  newdata[index,] = newrow
                  index= index + 1
    
           
           }
    
    }
    
    newdata = as.data.frame(newdata)
    colnames(newdata) = colnames(data)[-3]
    newdata$cycle = factor(newdata$cycle)
    newdata$promoted = factor(newdata$promoted)
    newdata$subgroup = factor(newdata$subgroup)
    newdata$newcriteria = factor(newdata$newcriteria)
    summary(newdata)
    
    fit = lmer(promoted ~ subgroup+newcriteria +(1|cycle), data=newdata, family=binomial)
    summary(fit)
    So now there is 1 observation per a line with summary:
    Code: 
    Generalized linear mixed model fit by the Laplace approximation 
    Formula: promoted ~ subgroup + newcriteria + (1 | cycle) 
       Data: newdata 
      AIC  BIC logLik deviance
     9460 9491  -4726     9452
    Random effects:
     Groups Name        Variance Std.Dev.
     cycle  (Intercept) 0.056011 0.23667 
    Number of obs: 17081, groups: cycle, 9
    
    Fixed effects:
                 Estimate Std. Error z value Pr(>|z|)    
    (Intercept)   2.46866    0.08453  29.206   <2e-16 ***
    subgroup1    -0.21176    0.09702  -2.183   0.0291 *  
    newcriteria1  0.46508    0.25133   1.850   0.0642 .  
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
    
    Correlation of Fixed Effects:
                (Intr) sbgrp1
    subgroup1   -0.109       
    newcriteri1 -0.002 -0.381
    You'll note that even the variance and Std. Dev estimates for the random effect are exactly the same, yet the number of observations has jumped to 17081. It is interesting the likelihood has changed, which suggest internally something different is being done even though the estimates are exactly the same.

    edit and ps: According to MASS the first column in the response matrix is considered "success". Which is different than what I said earlier =) And its also confusing because if you ever use the factor version of the response ( column of 1s and 0s) the first factor level is considered failure and everything else is considered success. That's how I messed it up in my head.
    Last edited by Rounds; 12-16-2008 at 04:03 AM.

+ Reply to Thread
Page 2 of 2 FirstFirst 1 2

           




Similar Threads

  1. Am I doing binary logistic regression correctly?
    By Sciencekid in forum Regression Analysis
    Replies: 0
    Last Post: 10-31-2010, 07:51 AM
  2. AIC in Binary Logistic Regression
    By lumhearts in forum SPSS
    Replies: 4
    Last Post: 10-11-2010, 11:12 AM
  3. Binary Logistic Regression
    By Frenz5iron in forum SPSS
    Replies: 5
    Last Post: 06-19-2010, 11:06 AM
  4. Replies: 1
    Last Post: 05-07-2010, 06:49 PM
  5. Help interpreting a binary logistic regression
    By blastStu in forum Biostatistics
    Replies: 2
    Last Post: 04-20-2010, 09:53 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats