dealing with zeros in count dataset

#1
Hi,
I have a count dataset which has about 70% zero's from 50 odd samples. I am using this sample (50 survey locations) to estimate population size of a species of gibbon across a landscape. However, because my data is so positively skewed, my lower CI for the popualtion estimate are negative, which is obviously not true as we found gibbons at several sites. The research I have done suggests that transforming the data won't be helpful as it won't remove the positive skew.

How can i deal with this data to get reasonable CI's around my population estimate? Any help would be much appreciated.

Cheers
nomascus
 

bugman

Super Moderator
#2
Hi nomascus,

when you have count data the distribution is likely to be poisson rather than normal where the mean = the variance.

Adding a constant to the data set (e.g 0.5) will allow a square root transformation, which is usually appropriate for count data will alot of zeros.

Phil.
 
#3
Hi Phil,
Thanks for the reply. Have tried this and x+1 log transformations, but the issue of having 70% of the distribution in the far left hand tail remains...nowhere near normal. Any other suggestions?
Cheers
Nomascus
 
#5
Hi Nomascus,

I just wanted to add that if 70% of your values are 0, then the Poisson assumption of mean=variance that Phil noted is unlikely to hold.

But there are a number of extensions to the Poisson Regression model that can take the higher variance into account. These include overdispersed Poisson, negative binomial models or zero-inflated Poisson (ZIP).

The first two are more appropriate if you not only have a lot of 0s, but also a few high counts at certain locations. If it's just more 0s, then a ZIP is usually a better model. The ZIP is actually pretty cool--it's a mixture model of a binary logistic regession and a Poisson.

Anyway, if you have more questions on it, let me know. I can recommend some good resources for learning Poisson regression.

Karen
 

bugman

Super Moderator
#6
Actually Karen, I would be keen on having a look on at those resources - especially regarding the negative binomial and ZIP - if you wouldn't mind.

Thanks for the post.

Phil
 
#7
Sure, Phil. I'll just post here.

I would start with:

  • Gardener,W., Mulvey, E.P., Shaw, E.C. (1995) Regression Analyses of Counts and Rates: Poisson, Overdispersed Poisson, and Negative Binomial Models. Psychological Bulletin.
  • Long, J.S. (1997) Regression Models for Categorized and Limited Dependent Variables. Thousand Oaks, CA: Sage Publications.
  • Long, J. Scott & Freese, Jeremy. (2006). Regression Models for Categorical Dependent Variables Using Stata, Stata Press.

The Gardener article is a brief overview. It doesn't have anything about ZIP, but is easy to read and a great place to start if you're new to the topic. It does have Poisson and Negative Binomial.

The Long book is fabulous, but gets a little mathy. He is a sociologist, so it is definitely written for researchers, not statisticians. But there is calculus and linear algebra in places. This is the best place to find indepth info, esp. ZIP.

The Long and Freese book is more applied, less math. I would recommend it even if you don't use Stata.

Because I get asked for resources so often, I created a list of my favorite ones for many topics at: http://www.analysisfactor.com/resources/Top-Resources.html. I pulled these right out of it. To get it, you have to sign up, but you can always unsubscribe if you're not interested.

And it just happens my next teleseminar is on Poisson and Negative Binomial Regression. It's on Jan 27th, and it's free, but you have to register. (That's why I've been rereading this stuff recently).:)

Karen