Non-parametric tests - still valid if your data are dominated by 0s and 1s?

Hi everyone,
I have a question about non-parametric tests and what to do when your data are dominated by low values (e.g. 0s and 1s)? I'll give an example below.

I am doing some research on butterfly sampling, aiming to determine whether each species is more abundant in high-quality or degraded habitat. I concurrently sampled each habitat type with five traps for five days, so I should be able to compare the number of captures using 25 replicates. The data are not normally distributed and standard transformations do not correct this, so I'm planning to use non-parametric tests, probably Mann-Whitney U-tests.

To give an example, here are the captures for one species in intact habitat and degraded habitat:

A Mann-Whitney U test suggests a statistically significant difference between the samples (U = 197.5, n = 25, P = <0.05), and when you compare the total captures per habitat that makes sense, as intact = 25, degraded = 11.

However, when I look at the data in more detail I worry about the accuracy of the test. The dataset is dominated by 0s and 1s (making up 90% of all results, see below), so when the data are converted to ranks for the Mann-Whitney U test then there are going to be a very high number of matched ranks.

Count Intact Degraded
0 7 14
1 13 11
2 3 0
3 2 0

So I was wondering whether I can rely on a test when the data are dominated by a few values in this way? Is it likely to affect accuracy of the test? Should I try to pool data in order to increase the heterogeneity of counts.

On balance I guess that it would be more likely to cause a Type II error due to similarity of summed ranks, so a significant difference should be fairly reliable - would that be fair to say?

Re: Non-parametric tests - still valid if your data are dominated by 0s and 1s?

If there comes yet an other user with "non-parametric" in their title and says:

Originally Posted by NCM

The data are not normally distributed and ..... so I'm planning to use non-parametric tests...

... then I think I am going to scream!

Iiiiiiiiiiiiiiiiiiiiiiiiiiiii!

- - -

If the data turn out to be not normally distributed, isn't it then more natural to abandon the normal distribution, and look for other distributions, instead of throwing away the idea all together, that the data can be modelled by a distribution?

Re: Non-parametric tests - still valid if your data are dominated by 0s and 1s?

Hi GretaGarbo, thanks for the reply, and I'm sorry to make you scream!

I'm a field biologist and have not had advanced statistical training, so I'm afraid my methods are pretty basic. I generally work with simple tests (comparisons of means/medians and correlations, etc) for small datasets, and rarely look at other distributions. I was taught to test the data for similarity to a normal distribution, and then to test potential transformations to approximate a normal distribution - if neither work then we just use non-parametric tests. I'm happy to test some other distributions if it would help? Which would you suggest, and would it be possible with only 25 replicates?

Re: Non-parametric tests - still valid if your data are dominated by 0s and 1s?

Thanks GretaGarbo for the reply and for including your workings in R. I'm going to play around with this for a while and see if I can work it out, and will post again afterwards.

Re: Non-parametric tests - still valid if your data are dominated by 0s and 1s?

Ok I've been looking into this and reading up on some of the techniques that I'm not familiar with, but I don't have much experience with GLMs or R so I'm afraid I'm at the limit of my knowledge here. I'll try to explain my thoughts on this below, but I apologise if it's a bit simplistic.

So with the example in R it looks like you ran a generalised linear model to compare the intact and degraded cases, based on the assumption that the data approximate to a Poisson distribution. By my understanding, a GLM is a flexible type of regression that allows for non-normal distributions, so is the aim of the statistical test to determine differences between the slopes of the regression in each habitat? I'm a bit lost on this, what exactly is the GLM comparing?

Looking back at the Mann-Whitney U test, would that also be suitable to answer the question - whether significantly more butterflies are caught in intact than degraded forest - or is it too simplistic?