Dear Sirs,

I perform several thousand tests on a genomic dataset (Mann Whitney U test). However, there is great variability in the sample size for each test (per test the number of measurements (independent) varies between 4 and 20 with outliers up to 75).

I understand that tests with more measurements (samples) have more power and are more likely to survive multiple testing (I use FDR at the moment). (i.e. the resulting list will be biased towards regions with larger sample size)

However, a reviewer pointed out to me that, apart from this bias, the p-values are "not comparable" and that standard adjustment procedures like FDR can not be used at all.

I understand the bias, but if I accept that (and describe it in the interpretation of the results as I do extensively), I do not see a problem.

Thanks for any insights!

Kind regards,