Real data, what is that? This is real data, right?
Code:> x <- rnorm(10) > x [1] 0.44272972 -0.12391179 -0.05046462 -0.32750574 -0.01735622 0.40255914 0.30132656 [8] -0.14345201 -1.76344836 -0.20450095
so... now that i'm a PhD student my advisor wants me to help out people with their projects. as fate may have it, the same problem has arisen both in two situations: a friend of mine working on her dissertation and another one working on an applied research project: the case of low-count cells for logistic regression.
apparently, this is a lot more common than i thought it was. the friend who's working on her dissertation is documenting the impact that it has on coefficient estimates, SEs, convergence, etc. when i asked her what kind of solutions were out there, she said 99% of people either dropped a category or merged it with another one. so no real help there.
i reached out to my Stats Dept friends and two people independently mentioned that they've heard a Bayesian approach would probably work well, but they failed to elaborate on why (the gradstudent who's buff on Bayesian stuff's already on holidays).
so i'm just wondering if anyone (and by anyone i mean Dason) has heard of a Bayesian solution to obtain better parameter estimates when there are low-count cells in logistic regression models. or, to be honest, i'm open to *any* approach that doesn't imply getting rid of data.
(PS- has anyone noticed how EFFING ugly real data is? uuuugghhh!)
for all your psychometric needs! https://psychometroscar.wordpress.com/about/
Real data, what is that? This is real data, right?
Code:> x <- rnorm(10) > x [1] 0.44272972 -0.12391179 -0.05046462 -0.32750574 -0.01735622 0.40255914 0.30132656 [8] -0.14345201 -1.76344836 -0.20450095
Hi! We are glad that you posted here! I would suggest checking out this thread for some guidelines on smart posting behavior that can help you get answers that are better much more quickly.
For instance the third bullet of point 2 ...
No but seriously, what kind of logistic regression models are we taking about? An experiment with X factors (sex, education level) and as response a binary (1,0) or count (# success, # fail) variable?
If so, yes a Bayesian model would work. But I'm also interested in hearing which technique forced people to drop categories with low-counts?
Here is a good resource with examples from the BUGS language that may have what you need:
https://github.com/johnmyleswhite/JAGSExamples
The true ideals of great philosophies always seem to get lost somewhere along the road..
Gelman and friends have a paper about a "default" prior distribution for logistic regression models:
http://www.stat.columbia.edu/~gelman...d/priors11.pdf
When I have encountered this in the past (just one time) I obtained standard errors via bootstrapping and that seemed to work well.
In God we trust. All others must bring data.
~W. Edwards Deming
for all your psychometric needs! https://psychometroscar.wordpress.com/about/
how *DARE* you use the forum rules against me!? GUARDS!!! GUARDS!!! OFF WITH HIS HEAD!!!
basically. the problem i have (borrowing from my friend who's doing the applied project). goes more or less like this.
say for example that you're trying to model whether people reply 'yes' or 'no' to some variable that i think it's called 'suicidal intention'. so if you start by setting up a contingency table where maybe you have as predictor gender, you can have men/women and then yes/no to suicide ideation. the problem starts (for her) when she starts adding classification layers. like maybe you have the category 'has sought a mental health specialist', 'has considered seeking mental health specialist' and 'has not sought mental health specialist'. suddenly, when you keep classifying and sub-classisfying people (like % are men who have sought help and replied 'yes' to suicide ideation, %women who have sought help and replied 'yes' to suicide ideation), the counts on each cell start dropping more and more. now, logistic regression is better suited than this muliple multi-way contingency tables BUT she's finding that maybe a few thousands (this is a national database) concentrate on some categories whereas just a few dozens concentrate on others, so she can't get accurate regression coefficients. or she gets them but the SEs are HUGE. that's the problem of low-count cells in logistic regression. she just doesn't have a balanced-enough frequency table to get stable analyses.
no real 'technique' is being implemented. they just lump together categories that end up having very few people in them. say, like in my previous example, that there are very few people one the 'has sought a mental health specialist' and on the 'considered seeking a mental health specialist'. so they merge those two into one bigger category associated with seeking help from a specialist.
cool! have you used this before?
for all your psychometric needs! https://psychometroscar.wordpress.com/about/
you mean when you've encountered the low-count cell problems? you didn't run into issues of having estimated regression coefficients that were humongous? so you just bootstrapped your logistic regression or did you do stuff to it before?
and thanks for the link! it looks promising!
for all your psychometric needs! https://psychometroscar.wordpress.com/about/
But if you don't have much data... what is the problem with a large standard error? Surely that is to be expected - going bayesian hopefully doesn't miraculously fix that.
Also that link appears to be be for JAGS - not *BUGS. Similar but not exactly the same.
I don't have emotions and sometimes that makes me very sad.
i guess i was trying to imply the categories were on the predictors. the response variable stays at 0/1. i elaborated more on my previous reply to TE from the example my friend had.
she's looking at people with suicidal ideation (so basically replying "yes" or "no" to a question on a questionnaire) and bunch of (mostly categorical) predictors like gender, socio economic status (SES), access to mental health services, etc. this is coming form a national database so her sample is HUGE (on the hundreds of thousands). the problem is that whenever she starts looking at sub categorizing people (like the proportion of suicidal ideation on men, of certain SES, with certain degree of access to mental health services, with certain this and certain that, etc.) she starts running into the problem of maybe there's only 10 or 12 people in some categories but there are thousands on the others. that's what's rendering a lot of her logistic regressions useless. but the response variable is still that "yes"/"no" answer to the suicidal ideation quesiton.
for all your psychometric needs! https://psychometroscar.wordpress.com/about/
for all your psychometric needs! https://psychometroscar.wordpress.com/about/
People tend to refer to JAGS as a dialect of BUGS (see e.g. the readme), BUGS is the language. Just as R is the dialect and S is the language. So I stick to my guns and say no, it's *BUGS.
The width of the posterior distributions are also going to be large, for these low-information classes. You can't have 1 egg in your basket and make a 6 egg omelet. She will have to accept this. Strongly informative priors may solve some of this, but how is she going to validate this?
There are techniques that use a hierarchical model approach, lending power to low sample estimates from hierarchically modelled relationships - for instance, in plant survival (binary as your example), you may be able to use plant growth as a prior for the survival estimates (slow growing plants will have higher mortality)... so you can use the data on growth to estimate survival for plant species that have small sample size. If she can think of such a scheme she can improve her estimates. It's not perfect, as amongst other effects, the low sample size coefficients tend to show shrinkage towards the mean.
Thus, in the end, I have to say that there is no real cure for no data. When you have no data, you have no data, and the only real prescription is to get more data.
Yes.
The true ideals of great philosophies always seem to get lost somewhere along the road..
As I think back on the problem more, I think the situation if I remember correctly was a logistic regression with 2 categorical predictors where one of the cells had 0 successes. Not surprisingly glm() didn't like this at all, but a permutation test yielded very reasonable results. This has been a while though and it was not my data.
In God we trust. All others must bring data.
~W. Edwards Deming
this actually sounds like a very reasonable alternative (and it hadn't even crossed my mind until you mentioned it, so thanks!).
do you have the code for permutation test in logistic regression? or did you use any particular R package to do so? just by googling around i found about this 'glmperm' package which promises to do something similar to what you suggested.
for all your psychometric needs! https://psychometroscar.wordpress.com/about/
We used some package, I know there are at least 2 packages that do this, and I can't remember which one it was (I don't still have the code).
In God we trust. All others must bring data.
~W. Edwards Deming
Tweet |