# Analyzing dataset with zeros

#### Jbon

##### New Member
I am analyzing a dataset on offspring production in fruit flies that have mated with individuals from their own population or another population. So for each experiment I have data on offspring produced for four crosses: F1xM1 F2xM2 F1xM2 F2XM1 (N approximately 30 for each cross). I would ideally like to use a factorial anova where I am interested in the two main effects (source population of female and source population of male) and the interaction between them (which may indicate incompatibilities between flies from different populations). I am unable to normalize the residuals through transformation mainly because of zeros in the dataset as typically 5-20% of females do not produce any offspring (those that do produce typically have 25-40 offspring on average). One solution I have tried is to break the analysis into two parts: (1) are there differences among the four crosses in the likelihood of failing to produce offspring (using Fisher's exact test), (2) for those that did produce offspring I use a factorial ANOVA as a square root transformation does normalize the residuals. However, I have just come across the possibility of using a negative binomial regression approach to analyze all the data including zeros, but I'm not sure if this is a better solution. I am unclear on how I determine that a negative binomial model fits my data well (I am using SPSS), or more specifically is a better solution than the alternative of breaking the analysis into two parts. I understand that I could use things like BIC or AIC to compare different models (e.g. Poisson vs NEB), but that doesn't seem applicable here. The deviance/df estimate is about 2, and my understanding is that it ideally should be 1. I am not sure whether 2 would be considered too large indicating a bad model fit? Or, are there other ways I should be checking to verify that the model fits well? Any thoughts or advice would be much appreciated, thanks!

#### mmercker

##### New Member
This is a typical problem which should be solved with zero-inflated models (or, if they do not fit well, Hurdle-models), since there are two different processes wich generate your data: A Binomial process regarding the question if there are offspring or not (success/failure). And the second process - the number of offspring - is based on a Poisson distribution (or, if you have overdispersion, also a negative-Binomial distribution can be used). The advantage of zero-inflated models is, that you don't have to artificially split your data into different parts and that it is the "most natural" description of such a process.

I have no experiences with SPSS, in R you could use hurdle() or zeroinfl() from the package pscl.

#### Jbon

##### New Member
Thank you for the suggestion. I had thought that my data would not be a great fit for a zero-inflated model since the zeros are presumed to come from two different processes in the model (which I don't think is really the case for my data), but a hurdle model seems like it might be a good fit.

#### mmercker

##### New Member
since the zeros are presumed to come from two different processes in the model (which I don't think is really the case for my data)
Are your sure? From my point of view the fact that a subset of females does not produce any offspring, and if they do, the expected value is far away from zero indicates that you actually have two different processes - provided that you can't explain the zeros just by one cross which is simply sterile. But if you observe a bimodal histogram for each of your factor levels, I would infer from that that there are two different processes interfering with each other. Furthermore, offspring studies are very typical examples where zero-inflated models are used. Finally, hurdle models also do assume two different processes, but they solve this problem a little bit "less naturally" but usually converge better

#### Jbon

##### New Member
I guess I'm not sure, and perhaps misunderstand the model. They way I think I understand it, the model assumes that some portion of the zeros comes from individuals that were "ineligible." An example I have seen involves how many fish people have caught where some people catch zero fish, but others simply did not go fishing (but it is not known which individuals didn't fish). In my case, it seems like that would be equivalent to if I had not observed matings such that some individuals mated but did not produce offspring while others simply did not mate. I didn't say it in my previous post, but I did observe that all females mated, so I cannot see a biological reason to consider some "ineligible" to reproduce (unless I assume that some portion are actually sterile, which is pretty unlikely to be many). Is this understanding inaccurate? Thanks!

#### mmercker

##### New Member
I nevertheless think you should use zero-inflated models, since your data apparently show that there are two different processes going on: I guess you have a bimodal histogram (with one peak at zero and another peak at ~35). Even if all females mated, I think that there are several possible biological reasons for the yes/no Binomial process fist of all deciding if there are offspring or not. Think about humans: Also they do not produce a baby each time they make love, and possible reasons are versatile: age and fitnes of parents, maybe genetic compatibility, fitnes of sperms, ovulation time of the woman, the random process if a sperm meets the ovum...