Maybe "zero inflated gamma" or "zero inflated lognormal" is worth searching for.
So I have been investigating the best model to run for my data. I originally looked at the zero inflated Poisson regression, but looked at my variance and mean and realized my variance is far greater than the meaning signaling over dispersion. Therefore, I elected to run a zero inflated negative binomial model. However, the problem is my dependent variable is continuous not count. What are my options?
I am analyzing what environmental factors predict distance moved in an animal. I have predictors such as rainfall, temp, cloud cover, etc. For most events the animal did not move, hence the zero inflated idea. But distance is not really a count. This may be a stupid question, but I researched the issue quite a bit before getting lost and deciding to post. Thanks for any input it's greatly appreciated as always.
Maybe "zero inflated gamma" or "zero inflated lognormal" is worth searching for.
zombie_kid (02-18-2014)
I agree
I'll also throw this one out there b/c I saw it recently for 1st time and is really cool, if the dependent variable was bounded you might consider zero-one inflated beta, but I'm guessing distance traveled by your critters isn't bounded
by the way, are you able to post a graph of the distribution of "distance"?
zombie_kid (02-18-2014)
[IMG][/IMG]
I spent some time researching zero inflated gamma and what others have been doing. It seems the common suggestion is to run a logistic model with any movement valued 1 and no movement 0 to determine the probability of not moving and then running a gamma glm for the data with all the zero movements removed. For the first step, I am not sure how to calculate the probability of not moving. Would I run the logistic glm and then take the intercept coef and convert it to odds ratio and then probability % or would I do it with the best model (a series of covariates)? If I run the logistic glm and chose the best model with covariates how do I get the probability not moving and how would I report that? The probability of not moving with the best model (cloud cover and temperature) to predict movement is xx%, and when they do move the best model is precipitation and temperature? Would it be something like that?
this looks promising, I've seen the zero frequency much higher (making ZI model more difficult) ... but I notice the bin width is ~20, so I'm not sure how many actual zeros there are, what's the proportion of Distance=0 points?
That's precisely what a ZI model does: it's a mix between logistic (to get probability of Distance=0) and the desired pdf (e.g. gamma). Indeed, your sources are correct that running a logistic model as you described is (probably, usually) a good way to get staring values for the logistic component of the ZI model.
There will be one log-odds estimate for each value of the vector (x1, x2, ..., xk), where x1...xk are k predictor variables, and the log-odds is with respect to whatever the reference level is for all variables combined (e.g. if k=2, the variables are cloud cover and temperature, then the comparison is probably when both = 0, by default) ... is that what you're asking?
in any event, there will be 2 linear components to your model:
for the logistic part
and
for the gamma part
where are, respec., the probability that Distance>0, the gamma component's mean, the design matrix (independent variables) for the logistic component, the coefficients for the logistic component, the design matrix for the gamma component, the coefficients for the gamma component
I say all that to say this: when you run a logistic regression as you described, the result can inform you about which variables should be in and what the starting values of should be ... of course, given that your sample size is large compared to the number of model parameters, you could start with both design matrices containing all variables and build the model from there
zombie_kid (02-19-2014)
The proportion of zeros is 889/1189. So I ran the logistic model and achieved the following results from the best model from an AIC stepwise selection:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.156957 0.245063 -4.721 2.35e-06 ***
cover.fpartly 0.443616 0.249647 1.777 0.07557 .
cover.fsunny -0.305031 0.162182 -1.881 0.06000 .
patch.f2 0.337086 0.191946 1.756 0.07906 .
patch.f3 0.765961 0.294270 2.603 0.00924 **
patch.f4 -0.286296 0.152999 -1.871 0.06131 .
Precipitation -0.006028 0.002309 -2.610 0.00905 **
TMIN 0.029289 0.013880 2.110 0.03485 *
cover.fpartly:Precipitation -0.003900 0.009433 -0.413 0.67929
cover.fsunny:Precipitation 0.012801 0.002769 4.623 3.79e-06 ***
I converted this to odds ratio and then into probability percentages, as I would for reporting in my results section. Which look like:
56% increase in liklihood of movement if cloudy.
26% decrease in liklihood of movement if sunny
40% Increase in movement if from patch 2
115% increase in movement if from patch 3
25% decrease in movement from patch 4
.01% decrease in movement for each unit increase in rainfaill (10ths of mm?)
2.9% Inrease in movement for each degree increase in TMIN
.01% decrease in movement if cloudy and unit increase in precip
.1.3% increase in movement if sunny and unit increase in rain
My next step was running the gamma glm with the zeros removed. The best model achieved based on stepwise AIC was the following:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0392863 0.0068262 5.755 1.9e-08 ***
patch.f2 -0.0089734 0.0032709 -2.743 0.00640 **
patch.f3 -0.0066998 0.0042488 -1.577 0.11574
patch.f4 -0.0020613 0.0032828 -0.628 0.53047
Temp -0.0008266 0.0003139 -2.634 0.00883 **
Now I am not sure how to interpret the coefficients from the gamma model. I have already done a conditional logistic model for habitat selection and a cox proportional hazards for survival, therefore I am familiar with interpreting odds ratios/ hazards rates. Do I calculate odds ratios the same way for the gamma model?
As for reporting in my results write up. Would I state the above probabilities of movement based on the best model and say something about the probability for each covariate like I did above, or do I need to somehow estimate overall probability of movement given that best model? This is where I get confused. I visioned it for example like The given the best model of Patch, Precipitation, Cloud Cover, and Minimum Temp, the probability of not moving is xx, but when animals do move the best model of Patch and Average daily Temp predict how far the animal moves.
The last thing you said is interesting about letting the logistic guide the gamma. It seems kind of silly to get a set of predictors that predict movement and then different ones for distance. Are you saying I could use only the predictors from the best logistic model in the gamma model? My next step is repeating all of this with movement/distance compared to weather the day prior. The above test was for the weather the day of the movement.
zombie_kid, are you familiar with SAS, too? I have some SAS code handy I can send you, I don't have my R code version availabel at the moment, would SAS code be any use to you? It's code that shows how to fit several ZI and "hurdle" models either with PROC GENMOD but also by directly using the log-liklihood (hurdle models are similar idea to ZI models)
Okay I found a post explaining interpretation of gamma. I still take the exp(coef) and for each one unit increase the distance moved would increase by the exp(coef). So for what I posted above the distance moved would increase .99 per unit change in Temp, or being from patch 2, 3, and 4 (they all have the same value for coef).
Or
Is it like in odds ratio that a value <1 means the response would decrease? When I plot Distance ~ The covariates in best model there appears to be an increasing linear relationship, so I do not think this would be true.
Unfortunately I am not familiar with SAS. Everyone drilled that R was the best into my head during grad school so I took a 2 week course in Statistics for Ecology in R.
that's okay, I can give the main gist
basically I wanted to solidify the idea that the model is a mixture model, it's a mixture between a logistic component and a gamma component, rather than two seperate models
Honestly, looking at the results I am getting with the gamma models I am wondering if it would just be best to test the hypothesis of what promotes movement rather than distance. I am thinking distance may not be the important thing at play here. The trick is convincing my committee.
Right, I think that's what Bolker describes in his book. I seem to be having a hard time wrapping my brain around mixture models.
eek
I just looked at the wikipedia page for mixture models, it's pretty hairy
I'm not familiar with your Bolker reference
Since you seem to be at a university, look up this paper, it's not in your field, but it's a pretty straightforward explanation of ZI models and how it's a mixture between a logistic and Poisson component ... only difference is yours is mixture of logistic and gamma components
C E Rose, S W Martin, K A Wannemuehler, B D Plikaytis. 2006. On the use of zero-inflated and hurdle models for modeling vaccine adverse event count data. Journal of Biopharmaceutical Statistics 16(4):463-81.
The mathematical explanation of a statistical procedure is really just pseudo-code, which we can make operational by translating it into real computer code. --B. Klemens
The mathematical explanation of a statistical procedure is really just pseudo-code, which we can make operational by translating it into real computer code. --B. Klemens
Tweet |