# Need help with analysing count data with lots of zeros

#### Smurfben

##### New Member
Hello all,

I'm a bit rusty on my stats so please be gentle with me as I might not always understand the terminology

I have a problem I am hoping you can help me with regarding some research that I am currently trying to analyze.

The study:

I scan sampled ant colonies 3 times a day for aggressive behaviour (bites, pulling and spreading) between queens and workers. I have two groups; a control group (the colony is un-manipulated) and a treatment group (I have altered the structure of the colony). My null hypothesis is that there is no difference in the numbers of aggressive interactions (test for bites, pulls and spreads individually) between the control and treatment colonies. I predict that there will be more aggression in the control group than in the treatment group.

my data for each colony per day ranges from 0-3. Aggressive interactions are typically rare across both groups, which means there are a lot of 0 counts in my data. Eye balling my data suggests to me that my prediction might be correct but I am unsure how to test it statistically. What would be the best way of testing my hypothesis?

All help will be very much appreciated. Thanks

#### noetsi

##### Fortran must die
One way you could do this, I don't know if it is ideal, would be to have as your dependent variable the four levels (0-3) you mention and a dummy independent variable coded 1 if there was an intervention and 0 otherwise (or 0 if there was an intervention, it does not matter). Then see if this IV is statistically significant with logistic regression. If the levels from 0-3 are ordered (that is for example 1 higher than 0 and 2 higher than 1) you use ordered logistic regression. Otherwise use multinominal logistic regression. If the intervention IV is significant, it suggests the intervention matters (note for this to be true you would also have to see the intervention had the effect that is logical for example decreased attacks if that is what you think the intervention causes).

Having a very small number of one type of case, limited variation, is a problem. One possible solution is to increase the overall number of measurements. Agresti suggests a minimum number of cases for the least common level of the DV -unfortunately I don't have readily available what that is. You might look for Agresti and logistic regression as a topic on line (I think it is in a 2006 book). If there is too little variation you might encounter seperation or quasi seperation which will mean you model won't run (or will generate nonsensical results). Most commerical software will warn if this is occuring.

#### Smurfben

##### New Member

I'm not sure if I was very clear with my experimental design. I sampled each colony at three different time points and recorded behaviour at that exact time. Basically I have this kind of data set (using the bite data as an example (I have made up these data)):

Control workers bite queens

Colony A1 A2 A3 A4 A5
Day
1 0 1 0 0 0
2 0 2 0 1 0
3 0 3 0 3 0
4 1 0 0 3 0
5 1 0 0 0 0

Treatment workers bite queens

Colony A1 A2 A3 A4 A5
Day
1 0 1 0 0 0
2 0 2 0 1 0
3 0 0 0 0 0
4 1 0 0 0 0
5 0 0 0 0 0

Because I sample 3 times a day, a queen could be bitten either 0 times, 1 time, 2 times or 3 times. I am not sure if your suggestion is still appropriate for this data set. I want to know if there are significantly more bites in the control than the treatment. I'm sorry if I'm being a bit thick