+ Reply to Thread
Page 1 of 2 1 2 LastLast
Results 1 to 15 of 25

Thread: Binary Logistic Regression in R

  1. #1
    Points: 2,913, Level: 33
    Level completed: 9%, Points required for next Level: 137

    Posts
    22
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Binary Logistic Regression in R




    Warning: this is going to be a long post.
    Acknowledgement: Thanks to The Ecologist for getting me started on what I hope is the right road.

    This is a continuation of a question I originally posted here concerning resampling. Thanks to the answers this site's members have provided I've moved away from that, so I thought it might be a good idea to re-post with a more appropriate title.

    What I've been asked to do is look at some historical promotion data and see if changes made in the evaluation criteria for some condidates have been succesful in improving their promotion rate. My data is for three different organizational levels. Here it is:

    Code: 
    Organizational Level 1
    Promotion Cycle Promoted     Not Promoted    Subgroup     NewCriteria
    1                     1307           180                  0               0
    2                     1481           193                  0               0
    3                     1213           99                   0               0
    4                     1444           118                  0               0
    5                     1668           113                  0               0
    6                     1782           134                  0               0
    7                     1579           126                  0               0
    8                     1811           133                  0               0
    9                     1848           114                  0               0
    1                     175             21                   1               0
    2                     161             27                   1               0
    3                     158             13                   1               0
    4                     155             13                   1               0
    5                     203             28                   1               0
    6                     183             17                   1               0
    7                     157             16                   1               0
    8                     185             10                   1               1
    9                     205             11                   1               1
    I'll spare you the other two data sets
    Note the first column is "promotion cycle" and repeats...the first 9 lines are for the general corporate population minus the subgroup, the next 9 lines are the corresponding numbers for the subgroup. Make sense?

    While my original reaction was that there a) isn't enough data here and b) there are way too many other factors influencing promotion rates, that answer wasn't acceptable to company management. So I'm trying to make the best of it....

    Now, The Ecologist suggested using R to perform a binary logistic regression on this data. The only regression I've ever done is OLS and I'd never heard of R. So I've obtained a copy of R and have checked out Hosmer and Lemeshow's "Applied Logistic Regression" from the local library, but I'm not sure I have the time to learn all of this on my own and still meet managment's deadline.

    So if anyone has any pointers...actual R command syntax is greatly appreciated.

    My first instinct is to look solely at the numbers before the criteria change, to see if there is any statistical support to the perception that the members of the subgroup were being promoted at a lower rate and thus even needed a change in criteria. I'm unsure as to how to then proceed to evaluate the impact of the criteria change, particulalry since there are only two promotion cycles with the changes in place (for the other two organizational levels there are three cycles with the changes in place).

    If you're still reading...thanks.

  2. #2
    Points: 3,489, Level: 36
    Level completed: 93%, Points required for next Level: 11

    Posts
    154
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Thats a huge amount of data for a logistic regression. Note each individual promotion is what is important in using logistic regression to estimate failure/success probabilities. You have over 10000 observations.

    For that data assuming you got it properly loaded into a data frame named "data" you would do:

    response = c(data$NotPromoted, data$promoted)
    fit = glm(response ~ Subgroup*NewCriteria, data=data, family=binomial)

    summary(fit)
    plot(fit)

  3. #3
    Points: 2,828, Level: 32
    Level completed: 52%, Points required for next Level: 72

    Location
    st. john's, newfoundland
    Posts
    51
    Thanks
    0
    Thanked 0 Times in 0 Posts
    one thing that will help you out a lot with R is the ? command. at the command prompt, you can type

    ? glm

    for example, and a good help file will open up on generalized linear models. these usually have good examples in them. for a binary logistic regression, you'll use this glm() function.

    if y is your dependent variable and x1, x2 are independent variables, a basic syntax looks like:

    my.model <- glm(y ~ x1 + x2, family=binomial(link = "logit"))

    if you have a data object already in R with named columns i.e. 'subgroup' or 'promoted', you could use

    my.model <- glm(subgroup ~ promoted, family=binomial(link = "logit"), data=my.data)

    have you got your data imported into R yet?
    the little heart beats so fast (8)

  4. #4
    R purist
    Points: 35,103, Level: 100
    Level completed: 0%, Points required for next Level: 0
    TheEcologist's Avatar
    Location
    United States
    Posts
    1,921
    Thanks
    303
    Thanked 607 Times in 341 Posts
    Quote Originally Posted by Rounds View Post
    Thats a huge amount of data for a logistic regression. Note each individual promotion is what is important in using logistic regression to estimate failure/success probabilities. You have over 10000 observations.

    For that data assuming you got it properly loaded into a data frame named "data" you would do:

    response = c(data$NotPromoted, data$promoted)
    fit = glm(response ~ Subgroup*NewCriteria, data=data, family=binomial)

    summary(fit)
    plot(fit)
    Not necessarily, each year is a trail not each person. So you use the amounts of success & failures for each year. So for 10 years you have 10 ‘replicates’ even though each year could consist of millions of trials.

    The syntax would be (for instance):

    #copy your data from something like excel:

    dat=read.table('clipboard',header=T)

    response = cbind(success=dat$CyclePromoted,
    fail=dat$NotPromoted)

    fit = glm(response ~ Subgroup+NewCriteria, data=dat, family=binomial)
    summary(fit)
    plot(fit)

    # note the DF....
    The true ideals of great philosophies always seem to get lost somewhere along the road..

  5. #5
    Points: 2,913, Level: 33
    Level completed: 9%, Points required for next Level: 137

    Posts
    22
    Thanks
    0
    Thanked 0 Times in 0 Posts
    You guys should both be commended for working with such a novice!

    I loaded my first set of data in R via pasting from Excel, and used the fit command jahred first provided
    Code: 
    fit = glm(response ~ Subgroup*NewCriteria, data=data, family=binomial)
    And received the following error:

    Code: 
    Error in model.frame.default(formula = response ~ Subgroup * NewCriteria, :
          variable lengths differ (found for 'Subgroup')
    Then I noticed that in subsequent posts you had both used "+" instead of "*" so I tried that, but got the same result. Then I thought perhaps there were some artifcats from Excel (unprintable characters, etc) but I found the data editor in R and everything looks clean. So where might I be goofing up now?

  6. #6
    R purist
    Points: 35,103, Level: 100
    Level completed: 0%, Points required for next Level: 0
    TheEcologist's Avatar
    Location
    United States
    Posts
    1,921
    Thanks
    303
    Thanked 607 Times in 341 Posts
    Quote Originally Posted by jerh View Post
    You guys should both be commended for working with such a novice!

    I loaded my first set of data in R via pasting from Excel, and used the fit command jahred first provided
    Code: 
    fit = glm(response ~ Subgroup*NewCriteria, data=data, family=binomial)
    And received the following error:

    Code: 
    Error in model.frame.default(formula = response ~ Subgroup * NewCriteria, :
          variable lengths differ (found for 'Subgroup')
    Then I noticed that in subsequent posts you had both used "+" instead of "*" so I tried that, but got the same result. Then I thought perhaps there were some artifcats from Excel (unprintable characters, etc) but I found the data editor in R and everything looks clean. So where might I be goofing up now?
    My syntax seems to work though. Try typing:

    names(yourdata)

    #do the names correspond to your column names in excel?
    The true ideals of great philosophies always seem to get lost somewhere along the road..

  7. #7
    Points: 2,913, Level: 33
    Level completed: 9%, Points required for next Level: 137

    Posts
    22
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Your syntax worked....from a quick glance at the help pages I think the problem was with c versus cbind...maybe?

    I'll post the results here in a few minutes....unfortunately, in order to get permission to use a "non-approved" piece of software (R) I had to install it on a machine that's not on the corporate network...so I have to manually retype R output here in order to post....isn't technology wonderful.

  8. #8
    Points: 3,489, Level: 36
    Level completed: 93%, Points required for next Level: 11

    Posts
    154
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Your statistical analysis should be the same if you do it as a weighted binomial or a unweighted binomial with individual lines in the design matrix provided you discarded the cycle predictor variable. I did btw run the code and saw the degrees of freedom, the only thing I can conclude is that it is misleading in this context.


    For example I reshaped the data to this:
    Promoted NotPromoted Subgroup NewCriteria
    newrow1 14133 1210 0 0
    newrow2 1192 135 1 0
    newrow3 390 21 1 1


    I fit the model, and the the fit for the coefficients, z-values, and standard errors and the p-values are *exactly* the same.

    The line that you question now reports:
    Degrees of Freedom: 2 Total (i.e. Null); 0 Residual
    Null Deviance: 13.46
    Residual Deviance: -9.259e-14 AIC: 26.33


    But remember the p-values of the coefficient are *exactly* the same. How would that be possible if you claim there is now *zero* replication.

    I have no idea what R is doing when it reports the residual degrees of freedom for a logistic regression. But I do know this, you could have 1 line per an observation in the design matrix or you can summarize them into their unique covariates and weight the regression by the number of observations per a unique covariate and the logistic regression is the same.

  9. #9
    Points: 3,489, Level: 36
    Level completed: 93%, Points required for next Level: 11

    Posts
    154
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by jerh View Post
    Your syntax worked....from a quick glance at the help pages I think the problem was with c versus cbind...maybe?
    Yeah thats exactly right. I noticed when I was investigating this residual degrees of freedom thing. Also always note the failure case is the first column. The other way to do the same thing is to use weights= as a glm argument, but this syntax is more intuitive.

    Do note however I discarded cycle. That actually could conceivably be relevant; I didn't want to talk about it cause it seems non-trivial. Also the interaction of subgroup and NewCriteria has issues because we never see the effect of newCriteria without subgroup 1.

    And lastly since your a novice an easy mistake here is to not turn your subgroup and criteria variables into "factors".

    You can tell with summary(data). It should have counts for each level of the factor instead of a quantile information if the column is a factor.

    You can factor a column as
    data$Subgroup = factor(data$Subgroup)

    This changes how the code in GLM constructs the design matrix.
    Last edited by Rounds; 12-04-2008 at 01:36 PM.

  10. #10
    Points: 2,913, Level: 33
    Level completed: 9%, Points required for next Level: 137

    Posts
    22
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Based on the fit I ran before seeing any of Rounds' posts, here's what came back. I won't embarass myself by trying to interpret these yet....I'll have to spend some time with Hosmer and Lemeshow first...

    Code: 
    summary(fit)
    
    
    Deviance residuals
         Min        1Q         Median       3Q        Max
     -5.6312   -0.1894     0.4753     1.1406   3.5666
    
    Coefficients:
                           Estimate     Std Error        z value      Pr(>|z|)
    (Intercept)        2.45789      0.02995          82.057       <2e-16     ***
    Subgroup          -0.27978     0.09562          -2.926       0.00343     **
    newMethod       0.74351      0.24172          3.076         0.00210     **
    
    Null deviance: 104.989  on 17 dof
    Residual deviance:  91.525 on 15 dof
    
    AIC: 198.13
    Last edited by jerh; 12-04-2008 at 02:00 PM.

  11. #11
    Points: 3,489, Level: 36
    Level completed: 93%, Points required for next Level: 11

    Posts
    154
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Yeah you commited that error I mentioned. I had to rewrite an entire midterm at the last minute once cause of it.

    If you had done it correctly it would say the factor level next to the coefficient as part of the name:

    Code: 
                 Estimate Std. Error z value Pr(>|z|)    
    (Intercept)  -2.45789    0.02995 -82.057  < 2e-16 ***
    Subgroup1     0.27978    0.09562   2.926  0.00343 ** 
    NewCriteria1 -0.74351    0.24172  -3.076  0.00210 **
    Note the Subgroup1 versus Subgroup, without the factor level appended to the coefficient name that is your cue that you fit subgroup as a continuous covariate versus a factor.

    On a side note, I just noticed it doesn't seem to matter. edit: It would have if there had been three levels to the factor or the levels had been something besides 0 and 1.

  12. #12
    R purist
    Points: 35,103, Level: 100
    Level completed: 0%, Points required for next Level: 0
    TheEcologist's Avatar
    Location
    United States
    Posts
    1,921
    Thanks
    303
    Thanked 607 Times in 341 Posts
    Quote Originally Posted by Rounds View Post
    On a side note, I just noticed it doesn't seem to matter. edit: It would have if there had been three levels to the factor or the levels had been something besides 0 and 1.
    Exactly it doesn't matter here.. but I feel that you are right Rounds. You should always use factors when appropriate.
    The true ideals of great philosophies always seem to get lost somewhere along the road..

  13. #13
    Points: 2,913, Level: 33
    Level completed: 9%, Points required for next Level: 137

    Posts
    22
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Rounds View Post
    Do note however I discarded cycle. That actually could conceivably be relevant; I didn't want to talk about it cause it seems non-trivial. Also the interaction of subgroup and NewCriteria has issues because we never see the effect of newCriteria without subgroup 1.
    Those are both questions I figured I'd get to later, but maybe they're best dealt with early. In this case "NewCriteria" is only applied to the subgroup, as a means to address a perceived gap in the promotion rate for the people in that subgroup. Once I feel more comfortable with this methodology I'd actually llke to look at the data from before the criteria change and see if it supports the perception that the members of the subgroup were being promoted at a lower rate. I'm not convinced....

    Ulimately, I would expect the cycle to be relevant since the "promotion opportunity" varies from cycle to cycle. I was worried that pooling the data without regard to cycle might be an issue....on the other hand, if you simply consider the annual rate and compare (like a paired-t) you only have 9 observations, and only 2 of those pertain to your new criteria.

  14. #14
    Points: 2,913, Level: 33
    Level completed: 9%, Points required for next Level: 137

    Posts
    22
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Rounds View Post
    Yeah you commited that error I mentioned. I had to rewrite an entire midterm at the last minute once cause of it.

    If you had done it correctly it would say the factor level next to the coefficient as part of the name:

    Code: 
                 Estimate Std. Error z value Pr(>|z|)    
    (Intercept)  -2.45789    0.02995 -82.057  < 2e-16 ***
    Subgroup1     0.27978    0.09562   2.926  0.00343 ** 
    NewCriteria1 -0.74351    0.24172  -3.076  0.00210 **
    Note the Subgroup1 versus Subgroup, without the factor level appended to the coefficient name that is your cue that you fit subgroup as a continuous covariate versus a factor.

    On a side note, I just noticed it doesn't seem to matter. edit: It would have if there had been three levels to the factor or the levels had been something besides 0 and 1.
    Again, forgive me if this is a naive question, but it does seem that the signs have switched around. Absolute values are the same, but my intercept and newCriteria were positive while my subgroup was negative, yours are the inverse.

  15. #15
    Points: 3,489, Level: 36
    Level completed: 93%, Points required for next Level: 11

    Posts
    154
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Are your response columns the same order as mine? First is the absence of the event of interest, second is the presence of the event of interest.

+ Reply to Thread
Page 1 of 2 1 2 LastLast

           




Similar Threads

  1. Am I doing binary logistic regression correctly?
    By Sciencekid in forum Regression Analysis
    Replies: 0
    Last Post: 10-31-2010, 07:51 AM
  2. AIC in Binary Logistic Regression
    By lumhearts in forum SPSS
    Replies: 4
    Last Post: 10-11-2010, 11:12 AM
  3. Binary Logistic Regression
    By Frenz5iron in forum SPSS
    Replies: 5
    Last Post: 06-19-2010, 11:06 AM
  4. Replies: 1
    Last Post: 05-07-2010, 06:49 PM
  5. Help interpreting a binary logistic regression
    By blastStu in forum Biostatistics
    Replies: 2
    Last Post: 04-20-2010, 09:53 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats