# Bayesian approach for logistic regression models? (the case of low-count cells)

#### spunky

##### Smelly poop man with doo doo pants.
so... now that i'm a PhD student my advisor wants me to help out people with their projects. as fate may have it, the same problem has arisen both in two situations: a friend of mine working on her dissertation and another one working on an applied research project: the case of low-count cells for logistic regression.

apparently, this is a lot more common than i thought it was. the friend who's working on her dissertation is documenting the impact that it has on coefficient estimates, SEs, convergence, etc. when i asked her what kind of solutions were out there, she said 99% of people either dropped a category or merged it with another one. so no real help there.

i reached out to my Stats Dept friends and two people independently mentioned that they've heard a Bayesian approach would probably work well, but they failed to elaborate on why (the gradstudent who's buff on Bayesian stuff's already on holidays).

so i'm just wondering if anyone (and by anyone i mean Dason) has heard of a Bayesian solution to obtain better parameter estimates when there are low-count cells in logistic regression models. or, to be honest, i'm open to *any* approach that doesn't imply getting rid of data.

(PS- has anyone noticed how EFFING ugly real data is? uuuugghhh!)

#### Englund

##### TS Contributor
Real data, what is that? This is real data, right?

Code:
> x <- rnorm(10)
> x
[1]  0.44272972 -0.12391179 -0.05046462 -0.32750574 -0.01735622  0.40255914  0.30132656
[8] -0.14345201 -1.76344836 -0.20450095

#### TheEcologist

##### R purist
so i'm just wondering if anyone (and by anyone i mean Dason) has heard of a Bayesian solution to obtain better parameter estimates when there are low-count cells in logistic regression models. or, to be honest, i'm open to *any* approach that doesn't imply getting rid of data.

Hi! :welcome: We are glad that you posted here! I would suggest checking out this thread for some guidelines on smart posting behavior that can help you get answers that are better much more quickly.

For instance the third bullet of point 2 ...

No but seriously, what kind of logistic regression models are we taking about? An experiment with X factors (sex, education level) and as response a binary (1,0) or count (# success, # fail) variable?

If so, yes a Bayesian model would work. But I'm also interested in hearing which technique forced people to drop categories with low-counts?

Here is a good resource with examples from the BUGS language that may have what you need:

https://github.com/johnmyleswhite/JAGSExamples

#### spunky

##### Smelly poop man with doo doo pants.
Real data, what is that? This is real data, right?
of course it is! actually, this is even MORE ral than that pesky data people go gather out there in that... that... how do they call it? oh yes, the "real world".

#### GretaGarbo

##### Human
Is this a standard two level (0 or 1) logit model?

How is it then possible to drop a category. Then it will only be one level left, and no variation.

she said 99% of people either dropped a category or merged it with another one.

......

(PS- has anyone noticed how EFFING ugly real data is? uuuugghhh!)
Yes, and how interesting!

#### spunky

##### Smelly poop man with doo doo pants.
Hi! :welcome: We are glad that you posted here! I would suggest checking out this thread for some guidelines on smart posting behavior that can help you get answers that are better much more quickly.

For instance the third bullet of point 2 ...
how *DARE* you use the forum rules against me!? GUARDS!!! GUARDS!!! OFF WITH HIS HEAD!!!

No but seriously, what kind of logistic regression models are we taking about? An experiment with X factors (sex, education level) and as response a binary (1,0) or count (# success, # fail) variable?
basically. the problem i have (borrowing from my friend who's doing the applied project). goes more or less like this.

say for example that you're trying to model whether people reply 'yes' or 'no' to some variable that i think it's called 'suicidal intention'. so if you start by setting up a contingency table where maybe you have as predictor gender, you can have men/women and then yes/no to suicide ideation. the problem starts (for her) when she starts adding classification layers. like maybe you have the category 'has sought a mental health specialist', 'has considered seeking mental health specialist' and 'has not sought mental health specialist'. suddenly, when you keep classifying and sub-classisfying people (like % are men who have sought help and replied 'yes' to suicide ideation, %women who have sought help and replied 'yes' to suicide ideation), the counts on each cell start dropping more and more. now, logistic regression is better suited than this muliple multi-way contingency tables BUT she's finding that maybe a few thousands (this is a national database) concentrate on some categories whereas just a few dozens concentrate on others, so she can't get accurate regression coefficients. or she gets them but the SEs are HUGE. that's the problem of low-count cells in logistic regression. she just doesn't have a balanced-enough frequency table to get stable analyses.

If so, yes a Bayesian model would work. But I'm also interested in hearing which technique forced people to drop categories with low-counts?
no real 'technique' is being implemented. they just lump together categories that end up having very few people in them. say, like in my previous example, that there are very few people one the 'has sought a mental health specialist' and on the 'considered seeking a mental health specialist'. so they merge those two into one bigger category associated with seeking help from a specialist.

Here is a good resource with examples from the BUGS language that may have what you need:

https://github.com/johnmyleswhite/JAGSExamples
cool! have you used this before?

#### spunky

##### Smelly poop man with doo doo pants.
Gelman and friends have a paper about a "default" prior distribution for logistic regression models:

http://www.stat.columbia.edu/~gelman/research/published/priors11.pdf

When I have encountered this in the past (just one time) I obtained standard errors via bootstrapping and that seemed to work well.
you mean when you've encountered the low-count cell problems? you didn't run into issues of having estimated regression coefficients that were humongous? so you just bootstrapped your logistic regression or did you do stuff to it before?

and thanks for the link! it looks promising!

#### Dason

But if you don't have much data... what is the problem with a large standard error? Surely that is to be expected - going bayesian hopefully doesn't miraculously fix that.

Also that link appears to be be for JAGS - not *BUGS. Similar but not exactly the same.

#### spunky

##### Smelly poop man with doo doo pants.
Is this a standard two level (0 or 1) logit model?

How is it then possible to drop a category. Then it will only be one level left, and no variation.
i guess i was trying to imply the categories were on the predictors. the response variable stays at 0/1. i elaborated more on my previous reply to TE from the example my friend had.

she's looking at people with suicidal ideation (so basically replying "yes" or "no" to a question on a questionnaire) and bunch of (mostly categorical) predictors like gender, socio economic status (SES), access to mental health services, etc. this is coming form a national database so her sample is HUGE (on the hundreds of thousands). the problem is that whenever she starts looking at sub categorizing people (like the proportion of suicidal ideation on men, of certain SES, with certain degree of access to mental health services, with certain this and certain that, etc.) she starts running into the problem of maybe there's only 10 or 12 people in some categories but there are thousands on the others. that's what's rendering a lot of her logistic regressions useless. but the response variable is still that "yes"/"no" answer to the suicidal ideation quesiton.

#### spunky

##### Smelly poop man with doo doo pants.
But if you don't have much data... what is the problem with a large standard error? Surely that is to be expected - going
true. but what to do about the non-convergences and huge coefficients. it's so werid, whenever she runs into stuff like that the computer either pukes or just gives it an answer that seems non-sensical (and, usually, non-significant although that is not always the case)

#### TheEcologist

##### R purist
Also that link appears to be be for JAGS - not *BUGS. Similar but not exactly the same.
People tend to refer to JAGS as a dialect of BUGS (see e.g. the readme), BUGS is the language. Just as R is the dialect and S is the language. So I stick to my guns and say no, it's *BUGS.

say for example that you're trying to model whether people reply 'yes' or 'no' to some variable that i think it's called 'suicidal intention'. so if you start by setting up a contingency table where maybe you have as predictor gender, you can have men/women and then yes/no to suicide ideation. the problem starts (for her) when she starts adding classification layers. like maybe you have the category 'has sought a mental health specialist', 'has considered seeking mental health specialist' and 'has not sought mental health specialist'. suddenly, when you keep classifying and sub-classisfying people (like % are men who have sought help and replied 'yes' to suicide ideation, %women who have sought help and replied 'yes' to suicide ideation), the counts on each cell start dropping more and more. now, logistic regression is better suited than this muliple multi-way contingency tables BUT she's finding that maybe a few thousands (this is a national database) concentrate on some categories whereas just a few dozens concentrate on others, so she can't get accurate regression coefficients. or she gets them but the SEs are HUGE. that's the problem of low-count cells in logistic regression. she just doesn't have a balanced-enough frequency table to get stable analyses.
The width of the posterior distributions are also going to be large, for these low-information classes. You can't have 1 egg in your basket and make a 6 egg omelet. She will have to accept this. Strongly informative priors may solve some of this, but how is she going to validate this?

no real 'technique' is being implemented. they just lump together categories that end up having very few people in them. say, like in my previous example, that there are very few people one the 'has sought a mental health specialist' and on the 'considered seeking a mental health specialist'. so they merge those two into one bigger category associated with seeking help from a specialist.
There are techniques that use a hierarchical model approach, lending power to low sample estimates from hierarchically modelled relationships - for instance, in plant survival (binary as your example), you may be able to use plant growth as a prior for the survival estimates (slow growing plants will have higher mortality)... so you can use the data on growth to estimate survival for plant species that have small sample size. If she can think of such a scheme she can improve her estimates. It's not perfect, as amongst other effects, the low sample size coefficients tend to show shrinkage towards the mean.

Thus, in the end, I have to say that there is no real cure for no data. When you have no data, you have no data, and the only real prescription is to get more data.

cool! have you used this before?
Yes.

#### Jake

you mean when you've encountered the low-count cell problems? you didn't run into issues of having estimated regression coefficients that were humongous? so you just bootstrapped your logistic regression or did you do stuff to it before?
As I think back on the problem more, I think the situation if I remember correctly was a logistic regression with 2 categorical predictors where one of the cells had 0 successes. Not surprisingly glm() didn't like this at all, but a permutation test yielded very reasonable results. This has been a while though and it was not my data.

#### spunky

##### Smelly poop man with doo doo pants.
but a permutation test yielded very reasonable results. This has been a while though and it was not my data.
this actually sounds like a very reasonable alternative (and it hadn't even crossed my mind until you mentioned it, so thanks!).

do you have the code for permutation test in logistic regression? or did you use any particular R package to do so? just by googling around i found about this 'glmperm' package which promises to do something similar to what you suggested.

#### Jake

We used some package, I know there are at least 2 packages that do this, and I can't remember which one it was (I don't still have the code).

#### spunky

##### Smelly poop man with doo doo pants.
We used some package
that's all i needed to know

still, for anyone interested in the Bayesian alternative to this, i plan on using this:

Kwang Woo Ahna, Kung-Sik Chana, Ying Baia & Michael Kosoy (2010) Bayesian Inference With Incomplete Multinomial Data: A Problem in Pathogen Diversity. JASA. Volume 105, Issue 490, 2010.

apprently, there are quite a few Bayesian approaches to work around this issue of low-count/0-count cells.

this is what i like about the board. i started up wtith one question and i'm leaving with 2 answers