Binary Logistic Regression in R

#21
That other task is taking significantly longer than I expected :(

I managed to recreate your results with regard to main group vs subgroup, and subgroup before and after change. I have some concerns though with the way I'm looking at the data. Earlier you had mentioned trying to clearly state the question I'm trying to answer. Maybe that will help.

The question, as posed to me, was "Did the evaluation changes made in 2005 improve the promotion rate of candidates in the subgroup as compared to candidates from outside the subgroup?" To me, the requisite first question is "was there a significant difference in the promotion rate between the two groups before the evaluation changes?" I think the answer to that question is no, but I need to be able to clearly support that.

Here's my big concern. Each of the aggregate data sets represent the results of an annual (once or twice a semiannual) promotion review board. Promotion rates vary across the entire population from board to board, so what we're really looking at is the difference between the rate for the subgroup and others within a given board. So when I aggregate all the data, don't I lose that correlation?

For instance, when running the regression on just the subgroup using the before/after factor, there's no accounting for the fact that in those two years that the new criteria were in place, the promotion rates across the entire population were higher than the average, right? So the new criteria factor looks like its associated with a better chance of being promoted, but it may just as well be that in those two years more people were being promoted full stop. Would this be (and I'm scared to type these words) some kind of multinomial regression where year becomes a factor with multiple levels?

You guys really should get paid for reading all this....
 
#22
I think I finally exhausted everyone's patience! :)

Just out of curiousity, before anyone mentioned binary logistic regression, my first stab at these numbers looked like this:

Code:
Population Promotion Rate          Subgroup Rate          Delta
87.9%                                     89.3%                    1.3906%
88.5%                                     85.6%                    -2.8324%
92.5%                                     92.4%                    -0.0566%
92.4%                                     92.3%                    -0.1837%
93.7%                                     87.9%                    -5.7765%
93.0%                                     91.5%                    -1.5063%
92.6%                                     90.8%                    -1.8585%
93.2%                                     94.9%*                   1.7134%
94.2%                                     94.9%*                   0.7178%   

* = new criteria in place

First 7 cycles (old criteria)
Delta mean: -1.5462%
std dev: 0.02321958
95% CI on delta mean [-3.266%, 0.174%]  Contains 0, can't say they're different
90% CI on delta mean [-2.9897%, -0.1026%] Doesn't contain 0, so you can 
say the delta is negative at the 90% confidence level

Last 2 cycles (new criteria)
Delta mean: 1.2156%
std dev: 0.007039628
95% CI on delta mean [0.2400%, 2.19121%] Doesn't contain 0, so you can
say the delta is positive at the 95% confidence level
While this is the answer that I know management wants -- the subgroup's rate was below the rest of the population's before, and is above now -- I'm really uncomfortable making such a statement based on 7 and 2 data points respectively. The confidence intervals were constructed using Excel's confidence function....
 
#23
I think your concerns are very valid. you have to choose between assuming there's no difference between years and then pool the data, or group it (e.g let X= mainrate-subgrouprate) and then deal with only 9 observations. The latter is, I think, clearly insufficient, so let's think about if you can reasonably assume away inter-year differences. Unfortunately, you can't claim that there's no omitted variable bias because the omitted variable is uncorrelated with the regressors, since year is clearly correlated with new criteria. So you need to be able to argue that year is insignificant. Since you've got a bunch of individual data for each year, you could try this: Drop the last two years. Then run a logit of promotion (yes or no) on year dummies and a group dummy. If the year dummies are jointly insignificant, you could argue that year doesn't really matter, at least in the first 7 years. Then you can argue that a significant coefficient on newcriteria in a logit regression w/ just subgroup and new criteria (which should be 1 for everyone in those years, btw) is actually due to new criteria. It's not water-tight, but it's not unreasonable. Just basing it on 7 vs 2 is unreasonable, I'd say.
 
#24
He has plenty of data because he has atleast 2 promotion boards per a factor level combination, and each has observations in the 100s.

The reason I didn't want to mess with it before is because promotion board isnt a fixed effect with 9 levels. Its a random effect and all of this is nested in it.
 
#25
So I was still interested in this problem, but for the wrong reasons. I believe this is a GLMM which can be fit by lmer in the lme4 package, that made it interesting to me. But the wrong reasons are I have no experience with GLMMs, lmer or lme4. So that's not helpful.

But I thought about what I would do if someone held a gun to my head.

My model is fixed effects for subgroup and newcriteria levels. And random effects for intercept nested in cycle .

That leads to the model in R of:
promoted ~ subgroup+newcriteria + (1|cycle)

My code was this :
Code:
library(lme4)
setwd('C:/Users/Jeremiah/Documents/Fall08/Research/lme/')
data = read.table("data.txt")
colnames(data) = c("cycle", "promoted", "notpromoted", "subgroup", "newcriteria")
data$cycle = factor(data$cycle)
data$subgroup = factor(data$subgroup)
data$newcriteria = factor(data$newcriteria)
summary(data)
response = cbind(data$promoted,data$notpromoted, ) #first one is success
fit1 = lmer(response ~ subgroup + newcriteria + (1|cycle), data=data, family=binomial)
And that leads to summary
Code:
Generalized linear mixed model fit by the Laplace approximation 
Formula: response ~ subgroup + newcriteria + (1 | cycle) 
   Data: data 
   AIC   BIC logLik deviance
 44.08 47.64 -18.04    36.08
Random effects:
 Groups Name        Variance Std.Dev.
 cycle  (Intercept) 0.056011 0.23667 
Number of obs: 18, groups: cycle, 9

Fixed effects:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   2.46866    0.08453  29.206   <2e-16 ***
subgroup1    -0.21176    0.09702  -2.183   0.0291 *  
newcriteria1  0.46537    0.25136   1.851   0.0641 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Correlation of Fixed Effects:
            (Intr) sbgrp1
subgroup1   -0.109       
newcriteri1 -0.002 -0.381
Based on that you have statistically significant evidence that subgroup 1 is less likely to be promoted, but you are marginal on evidence that the new criteria is a non-zero fixed effect. This is kind of a more data situation another cycle might just tilt it one way or another.

I am not really sure about diagnostics. Like I said new turf for me. Seek out an expert on GLMMs.

BTW that number of observations line is kind of misleading in this context. Consider:

Code:
#reshape to help lmer out
#one observation per a row
newdata = matrix(0, sum(data$promoted) + sum(data$notpromoted), ncol(data)-1)
index = 1
for(i in 1:nrow(data)) {
       promoted = data$promoted[i]
       notpromoted = data$notpromoted[i]
       newrow = as.numeric(as.matrix(data[i,]))
       newrow = newrow[-3]
       newrow[2] = 1 #level for promoted
       for(j in 1:promoted){
              newdata[index,] = newrow
              index = index + 1
       }
       newrow[2] = 0 #level for notpromoted
       for(j in 1:notpromoted){
              newdata[index,] = newrow
              index= index + 1

       
       }

}

newdata = as.data.frame(newdata)
colnames(newdata) = colnames(data)[-3]
newdata$cycle = factor(newdata$cycle)
newdata$promoted = factor(newdata$promoted)
newdata$subgroup = factor(newdata$subgroup)
newdata$newcriteria = factor(newdata$newcriteria)
summary(newdata)

fit = lmer(promoted ~ subgroup+newcriteria +(1|cycle), data=newdata, family=binomial)
summary(fit)
So now there is 1 observation per a line with summary:
Code:
Generalized linear mixed model fit by the Laplace approximation 
Formula: promoted ~ subgroup + newcriteria + (1 | cycle) 
   Data: newdata 
  AIC  BIC logLik deviance
 9460 9491  -4726     9452
Random effects:
 Groups Name        Variance Std.Dev.
 cycle  (Intercept) 0.056011 0.23667 
Number of obs: 17081, groups: cycle, 9

Fixed effects:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   2.46866    0.08453  29.206   <2e-16 ***
subgroup1    -0.21176    0.09702  -2.183   0.0291 *  
newcriteria1  0.46508    0.25133   1.850   0.0642 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Correlation of Fixed Effects:
            (Intr) sbgrp1
subgroup1   -0.109       
newcriteri1 -0.002 -0.381
You'll note that even the variance and Std. Dev estimates for the random effect are exactly the same, yet the number of observations has jumped to 17081. It is interesting the likelihood has changed, which suggest internally something different is being done even though the estimates are exactly the same.

edit and ps: According to MASS the first column in the response matrix is considered "success". Which is different than what I said earlier =) And its also confusing because if you ever use the factor version of the response ( column of 1s and 0s) the first factor level is considered failure and everything else is considered success. That's how I messed it up in my head.
 
Last edited: