# Binary Logistic Regression in R

#### jerh

##### New Member
Warning: this is going to be a long post.
Acknowledgement: Thanks to The Ecologist for getting me started on what I hope is the right road.

This is a continuation of a question I originally posted here concerning resampling. Thanks to the answers this site's members have provided I've moved away from that, so I thought it might be a good idea to re-post with a more appropriate title.

What I've been asked to do is look at some historical promotion data and see if changes made in the evaluation criteria for some condidates have been succesful in improving their promotion rate. My data is for three different organizational levels. Here it is:

Code:
Organizational Level 1
Promotion Cycle Promoted     Not Promoted    Subgroup     NewCriteria
1                     1307           180                  0               0
2                     1481           193                  0               0
3                     1213           99                   0               0
4                     1444           118                  0               0
5                     1668           113                  0               0
6                     1782           134                  0               0
7                     1579           126                  0               0
8                     1811           133                  0               0
9                     1848           114                  0               0
1                     175             21                   1               0
2                     161             27                   1               0
3                     158             13                   1               0
4                     155             13                   1               0
5                     203             28                   1               0
6                     183             17                   1               0
7                     157             16                   1               0
8                     185             10                   1               1
9                     205             11                   1               1
I'll spare you the other two data sets :shakehead
Note the first column is "promotion cycle" and repeats...the first 9 lines are for the general corporate population minus the subgroup, the next 9 lines are the corresponding numbers for the subgroup. Make sense?

While my original reaction was that there a) isn't enough data here and b) there are way too many other factors influencing promotion rates, that answer wasn't acceptable to company management. So I'm trying to make the best of it....

Now, The Ecologist suggested using R to perform a binary logistic regression on this data. The only regression I've ever done is OLS and I'd never heard of R. So I've obtained a copy of R and have checked out Hosmer and Lemeshow's "Applied Logistic Regression" from the local library, but I'm not sure I have the time to learn all of this on my own and still meet managment's deadline.

So if anyone has any pointers...actual R command syntax is greatly appreciated.

My first instinct is to look solely at the numbers before the criteria change, to see if there is any statistical support to the perception that the members of the subgroup were being promoted at a lower rate and thus even needed a change in criteria. I'm unsure as to how to then proceed to evaluate the impact of the criteria change, particulalry since there are only two promotion cycles with the changes in place (for the other two organizational levels there are three cycles with the changes in place).

#### Rounds

##### New Member
Thats a huge amount of data for a logistic regression. Note each individual promotion is what is important in using logistic regression to estimate failure/success probabilities. You have over 10000 observations.

For that data assuming you got it properly loaded into a data frame named "data" you would do:

response = c(data$NotPromoted, data$promoted)
fit = glm(response ~ Subgroup*NewCriteria, data=data, family=binomial)

summary(fit)
plot(fit)

#### jahred

##### New Member
one thing that will help you out a lot with R is the ? command. at the command prompt, you can type

? glm

for example, and a good help file will open up on generalized linear models. these usually have good examples in them. for a binary logistic regression, you'll use this glm() function.

if y is your dependent variable and x1, x2 are independent variables, a basic syntax looks like:

my.model <- glm(y ~ x1 + x2, family=binomial(link = "logit"))

if you have a data object already in R with named columns i.e. 'subgroup' or 'promoted', you could use

my.model <- glm(subgroup ~ promoted, family=binomial(link = "logit"), data=my.data)

have you got your data imported into R yet?

#### TheEcologist

##### Global Moderator
Thats a huge amount of data for a logistic regression. Note each individual promotion is what is important in using logistic regression to estimate failure/success probabilities. You have over 10000 observations.

For that data assuming you got it properly loaded into a data frame named "data" you would do:

response = c(data$NotPromoted, data$promoted)
fit = glm(response ~ Subgroup*NewCriteria, data=data, family=binomial)

summary(fit)
plot(fit)
Not necessarily, each year is a trail not each person. So you use the amounts of success & failures for each year. So for 10 years you have 10 ‘replicates’ even though each year could consist of millions of trials.

The syntax would be (for instance):

#copy your data from something like excel:

response = cbind(success=dat$CyclePromoted, fail=dat$NotPromoted)

fit = glm(response ~ Subgroup+NewCriteria, data=dat, family=binomial)
summary(fit)
plot(fit)

# note the DF....

#### jerh

##### New Member
You guys should both be commended for working with such a novice!

I loaded my first set of data in R via pasting from Excel, and used the fit command jahred first provided
Code:
fit = glm(response ~ Subgroup*NewCriteria, data=data, family=binomial)

Code:
Error in model.frame.default(formula = response ~ Subgroup * NewCriteria, :
variable lengths differ (found for 'Subgroup')
Then I noticed that in subsequent posts you had both used "+" instead of "*" so I tried that, but got the same result. Then I thought perhaps there were some artifcats from Excel (unprintable characters, etc) but I found the data editor in R and everything looks clean. So where might I be goofing up now?

#### TheEcologist

##### Global Moderator
You guys should both be commended for working with such a novice!

I loaded my first set of data in R via pasting from Excel, and used the fit command jahred first provided
Code:
fit = glm(response ~ Subgroup*NewCriteria, data=data, family=binomial)

Code:
Error in model.frame.default(formula = response ~ Subgroup * NewCriteria, :
variable lengths differ (found for 'Subgroup')
Then I noticed that in subsequent posts you had both used "+" instead of "*" so I tried that, but got the same result. Then I thought perhaps there were some artifcats from Excel (unprintable characters, etc) but I found the data editor in R and everything looks clean. So where might I be goofing up now?
My syntax seems to work though. Try typing:

names(yourdata)

#do the names correspond to your column names in excel?

#### jerh

##### New Member
Your syntax worked....from a quick glance at the help pages I think the problem was with c versus cbind...maybe?

I'll post the results here in a few minutes....unfortunately, in order to get permission to use a "non-approved" piece of software (R) I had to install it on a machine that's not on the corporate network...so I have to manually retype R output here in order to post....isn't technology wonderful.

#### Rounds

##### New Member
Your statistical analysis should be the same if you do it as a weighted binomial or a unweighted binomial with individual lines in the design matrix provided you discarded the cycle predictor variable. I did btw run the code and saw the degrees of freedom, the only thing I can conclude is that it is misleading in this context.

For example I reshaped the data to this:
Promoted NotPromoted Subgroup NewCriteria
newrow1 14133 1210 0 0
newrow2 1192 135 1 0
newrow3 390 21 1 1

I fit the model, and the the fit for the coefficients, z-values, and standard errors and the p-values are *exactly* the same.

The line that you question now reports:
Degrees of Freedom: 2 Total (i.e. Null); 0 Residual
Null Deviance: 13.46
Residual Deviance: -9.259e-14 AIC: 26.33

But remember the p-values of the coefficient are *exactly* the same. How would that be possible if you claim there is now *zero* replication.

I have no idea what R is doing when it reports the residual degrees of freedom for a logistic regression. But I do know this, you could have 1 line per an observation in the design matrix or you can summarize them into their unique covariates and weight the regression by the number of observations per a unique covariate and the logistic regression is the same.

#### Rounds

##### New Member
Your syntax worked....from a quick glance at the help pages I think the problem was with c versus cbind...maybe?
Yeah thats exactly right. I noticed when I was investigating this residual degrees of freedom thing. Also always note the failure case is the first column. The other way to do the same thing is to use weights= as a glm argument, but this syntax is more intuitive.

Do note however I discarded cycle. That actually could conceivably be relevant; I didn't want to talk about it cause it seems non-trivial. Also the interaction of subgroup and NewCriteria has issues because we never see the effect of newCriteria without subgroup 1.

And lastly since your a novice an easy mistake here is to not turn your subgroup and criteria variables into "factors".

You can tell with summary(data). It should have counts for each level of the factor instead of a quantile information if the column is a factor.

You can factor a column as
data$Subgroup = factor(data$Subgroup)

This changes how the code in GLM constructs the design matrix.

Last edited:

#### jerh

##### New Member
Based on the fit I ran before seeing any of Rounds' posts, here's what came back. I won't embarass myself by trying to interpret these yet....I'll have to spend some time with Hosmer and Lemeshow first...

Code:
summary(fit)

Deviance residuals
Min        1Q         Median       3Q        Max
-5.6312   -0.1894     0.4753     1.1406   3.5666

Coefficients:
Estimate     Std Error        z value      Pr(>|z|)
(Intercept)        2.45789      0.02995          82.057       <2e-16     ***
Subgroup          -0.27978     0.09562          -2.926       0.00343     **
newMethod       0.74351      0.24172          3.076         0.00210     **

Null deviance: 104.989  on 17 dof
Residual deviance:  91.525 on 15 dof

AIC: 198.13

Last edited:

#### Rounds

##### New Member
Yeah you commited that error I mentioned. I had to rewrite an entire midterm at the last minute once cause of it.

If you had done it correctly it would say the factor level next to the coefficient as part of the name:

Code:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)  -2.45789    0.02995 -82.057  < 2e-16 ***
Subgroup1     0.27978    0.09562   2.926  0.00343 **
NewCriteria1 -0.74351    0.24172  -3.076  0.00210 **
Note the Subgroup1 versus Subgroup, without the factor level appended to the coefficient name that is your cue that you fit subgroup as a continuous covariate versus a factor.

On a side note, I just noticed it doesn't seem to matter. edit: It would have if there had been three levels to the factor or the levels had been something besides 0 and 1.

#### TheEcologist

##### Global Moderator
On a side note, I just noticed it doesn't seem to matter. edit: It would have if there had been three levels to the factor or the levels had been something besides 0 and 1.
Exactly it doesn't matter here.. but I feel that you are right Rounds. You should always use factors when appropriate.

#### jerh

##### New Member
Do note however I discarded cycle. That actually could conceivably be relevant; I didn't want to talk about it cause it seems non-trivial. Also the interaction of subgroup and NewCriteria has issues because we never see the effect of newCriteria without subgroup 1.
Those are both questions I figured I'd get to later, but maybe they're best dealt with early. In this case "NewCriteria" is only applied to the subgroup, as a means to address a perceived gap in the promotion rate for the people in that subgroup. Once I feel more comfortable with this methodology I'd actually llke to look at the data from before the criteria change and see if it supports the perception that the members of the subgroup were being promoted at a lower rate. I'm not convinced....

Ulimately, I would expect the cycle to be relevant since the "promotion opportunity" varies from cycle to cycle. I was worried that pooling the data without regard to cycle might be an issue....on the other hand, if you simply consider the annual rate and compare (like a paired-t) you only have 9 observations, and only 2 of those pertain to your new criteria.

#### jerh

##### New Member
Yeah you commited that error I mentioned. I had to rewrite an entire midterm at the last minute once cause of it.

If you had done it correctly it would say the factor level next to the coefficient as part of the name:

Code:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)  -2.45789    0.02995 -82.057  < 2e-16 ***
Subgroup1     0.27978    0.09562   2.926  0.00343 **
NewCriteria1 -0.74351    0.24172  -3.076  0.00210 **
Note the Subgroup1 versus Subgroup, without the factor level appended to the coefficient name that is your cue that you fit subgroup as a continuous covariate versus a factor.

On a side note, I just noticed it doesn't seem to matter. edit: It would have if there had been three levels to the factor or the levels had been something besides 0 and 1.
Again, forgive me if this is a naive question, but it does seem that the signs have switched around. Absolute values are the same, but my intercept and newCriteria were positive while my subgroup was negative, yours are the inverse.

#### Rounds

##### New Member
Are your response columns the same order as mine? First is the absence of the event of interest, second is the presence of the event of interest.

#### jerh

##### New Member
Are your response columns the same order as mine? First is the absence of the event of interest, second is the presence of the event of interest.
I see...mine were as shown above: first column (Promoted) is presence of the event, second column (Not Promoted) is the absence....

#### jerh

##### New Member
So at the risk of sounding very dumb, here goes:

The subgroup coefficient = -0.27978. exp(-0.27978)= 0.756. So this is interpreted as a member of the subgroup is 0.756 times as likely to be promoted as someone who is not...?

The newMethod coefficient = 0.74351. exp(0.74351)=2.103. So someone evaluated under the new criteria is 2.1 times as likely to be promoted as someone who isn't...?

Both of these appear significant due to the |z| values...?

But the relatively small decrease in deviance between this model and the null indicate it's not a very good overall fit...?

Assuming I'm not way off base, it seems that I may not be treating the new evaluation criteria correctly, since people who are not part of the subgroup are never going to be evaluated under the new criteria. Might it not be more apprpriate to fit the data before the change in criteria (with subgroup membership as the only factor), fit the data after the change, and see if there is a significant change in the subgroup coefficient?

#### TheEcologist

##### Global Moderator
So at the risk of sounding very dumb, here goes:

The subgroup coefficient = -0.27978. exp(-0.27978)= 0.756. So this is interpreted as a member of the subgroup is 0.756 times as likely to be promoted as someone who is not...?

The newMethod coefficient = 0.74351. exp(0.74351)=2.103. So someone evaluated under the new criteria is 2.1 times as likely to be promoted as someone who isn't...?

Both of these appear significant due to the |z| values...?

But the relatively small decrease in deviance between this model and the null indicate it's not a very good overall fit...?

Assuming I'm not way off base, it seems that I may not be treating the new evaluation criteria correctly, since people who are not part of the subgroup are never going to be evaluated under the new criteria. Might it not be more apprpriate to fit the data before the change in criteria (with subgroup membership as the only factor), fit the data after the change, and see if there is a significant change in the subgroup coefficient?
You don’t sound dumb at all.

The best way for you to get the model predictions is not to use the equation that you used, rather use (in R);

predict(modelname, type='response',se.fit=T) #this also gives you the standard errors.

The model is indeed not a very good fit at all. A simple and straight forward way to test the model fit is by doing this:

nullmodel=glm(response~1,data=dat, family=binomial)
#This is the equiv of R-squared
1-(logLik(yourmodel)/logLik(nullmodel))

Why don’t you first test for a difference in maingroup - subgroup and then test the subgroup - newcriteria separately (without the maingroup)? you should especially do this if this improves the model fits.

Maybe its a good idea to first write down your hypotheses clearly and in a testable manner (one to two sentences). This way we might be able to decide what the best course is.

#### TheEcologist

##### Global Moderator
Why don’t you first test for a difference in maingroup - subgroup and then test the subgroup - newcriteria separately (without the maingroup)? you should especially do this if this improves the model fits.

Running the analysis like this I find no difference between the maingroup and the subgroup (not in sig nor in AIC scores between null and model). Ergo no difference.

While within the subgroup (analysis without maingroup) however I find a sig. difference between the new criteria and the 'rest'. Model fit improves to 0.19

#### jerh

##### New Member
Thanks for all your help. I'm about to get yanked into something that I suspect will take the rest of today, but over the weekend I'm going to see try following your suggestions for this data set and the two others I have...hopefully next week I can post something more meaningful with regard to interpretation.

Thanks again!