How many 0s and 1s in your Y variable?
I ran a simple logistic regression in R and here are my results. my independent variables are all "counts" and I coded my Y as 0 and 1. This is a small dataset with 54 observations, and 5 independent variables.
I know that the P value is high, so given that, what is my next step? Does it just mean my variables are not significant and I should choose another set of variables? Is there any type of "troubleshooting" I can do?
also, I do not know how to interperpret the rest of the results, so any comments would be appreciated.
thanks!
Call:
glm(formula = Y~ ., family = binomial(link = "logit"), data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4045 -0.6166 -0.4228 -0.1674 2.0933
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Int) -2.40466 1.37109 -1.754 0.0795 .
v1 0.10440 0.48008 0.217 0.8279
v2 -0.31214 0.55934 -0.558 0.5768
v3 0.04656 0.18316 0.254 0.7993
v4 0.48907 0.23262 2.102 0.0355 *
v5 -0.82624 0.74024 -1.116 0.2643
v6 0.20443 0.18118 1.128 0.2592
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 50.482 on 50 degrees of freedom
Residual deviance: 40.822 on 44 degrees of freedom
AIC: 54.822
Number of Fisher Scoring iterations: 5
How many 0s and 1s in your Y variable?
Stop cowardice, ban guns!
Hi hlsmith,
for my Y variable, I have about 33 0's and 10 1's, so roughly 75% and 25%.
Most of my X-variables are around 50/50, except for one which is also 75% 0's.
Does logistic regression require the data to be distributed in a certain way before it is effective? I tried the same regression with about 500 observations, and got similar results, except I got 1 variable with a p value that is 'significant.'
Just to give a background. All my data were in the form of counts, so I took the average of each variable and coded them as either 0 or 1 if it exceeds the average. If there is a more effective approach, let me know.
Well you lose information when you dichotomize your data. What I want to get at was you have "sparse binary" data. A generic general rule for logistic regression is that you need 10-20 events in your smaller outcome groups to power each predictor, So you would need 6 to 12 times more data based on that general rule.
Though you say that you ran the model with more data, so maybe those truly are variables that are not associated with the outcome. Could this be a possibility? What type of results do you see when you run 6 simple logistic regression models with one IV?
Stop cowardice, ban guns!
If there is too little variation in a variable that creates issues, but usually the problems are listed as being serious only over 90%. So if an IV had 90 percent at one level you would have issues (I think the problem that results is attenuation of the slope). According to certain econometricians (for example Agresti) you need a minimum number of cases in the least common level of your DV for logistic regression, but I do not remember what the number is. This is probably what Hlsmith is suggesting. These are rules of thumb not theoretical in nature and I am sure many disagree on this issue.
54 cases is not a lot of data. It may be that you do not have enough power to detect an effect.
"The difference between genius and stupidity is that genius has its limits."
Thanks for the replies. From your suggestions, there are a fewcauses for my results including:
1. not enough data
2. dichotomizing my data in arbitrary groupings.
To clarify, what did you mean by "sparse binary"? Is that just another way to say that there are not enough observations to have an effect, or is sparse binary a whole seperate issue?
Also, what did you mean by: "...10-20 events in your smaller outcome groups to power each predictor, So you would need 6 to 12 times more data based on that general rule...." What is my "smaller outcome groups" and what do you mean by "events"? I"m just curious on the general rule so I can have a concept in mind.
Also, Is there a better way to approach this problem in general? All my variables were in the form of Counts, and I did not know which regression would be appropriate, so I figured logistic would be the way to go. Would I be better of not dichotimizing and using the raw data on a different regression?
Sparse binary data is just a term for when you have many variables and when you break them down there are very few or no persons in the subgroupings.
Variables: Age, Sex, Exposure, Race, Insurance status, marital status, employment status,..., etc. So if you have 10 outcomes in the smaller of the two outcome groups, well if all of the variables were binary there would be 128 subgrouping combinations of these variables though you only have 10 people with the lesser outcome, so most of the combinations will be empty (e.g., no young, males with exposure, asian, insured, unmarried, and unemployed - thus it becomes goofy making predictions about people you don't even have data for even though created beta coefficients. Also, models have troubles converging in these scenarios. Does that make sense?
Is your dependent variable actually binary? Let us say it is, so your DV has two groups we will call 0 or 1 (which could represent Yes and No). So say 33 people are 1s and 10 are 0s. the smaller group is '0'. So a general rule is you need 10 - 20 people in your smaller group for each predictor you introduce to your model, So for 1 IV you would need 10-20 people, 2 IVs 20-40,..., 6 IVs (what you have) 60-120 people with 0s, so overall you would approximately need 6*(10 for smaller group), which would also mean 6*33 by default, so moreover your n=43 times 6-10, so a sample of 258 just to meet the bottom threshold of the general rule. The rule just gives you a generic marker to think about. Now if you added an interaction term that would mean you need to add at least another 43 people overall so the lesser group would increase by 10. I wrote this quickly, so I apologize for typos and clarity.
Your understanding of the project's context should guide you in what IVs to use, so it is not just a fishing expedition, where you say hey that came up significant. In addition, if there are too many predictors in the model you run the risk of over-fitting data using a small sample, which means your results may not be generalizable to other samples, they are too case specific, and your case (sample) is just one realization of the true population, so given sampling variability the next sample could differ.
So you can limit the number of predictors you try or perhaps use a penalization (Firth) or regularization model (Lasso, E-net), which the latter will get rid of correlated predictors and whittle your candidate set down to something the model can support.
Stop cowardice, ban guns!
hi,
v4 might be worth a bit more investigation, I think.
regards
Guys, it makes sense. Appreciate all your response. Rogojel, can you explain why you are v4 sticks out?
It is the only statistically significant variable in the saturated model. So it is able to hold its own while controlling for the other variables.
Stop cowardice, ban guns!
Tweet |