Thread: Distribution of a subset of the complete sample

1. Distribution of a subset of the complete sample

Dear all,

My question can be easily misunderstood, so please allow me to briefly explain the experimental conditions.

I have measured a variable from 120 individuals in total, 60 of one genotype and 60 of another. The distribution of the complete sample of each of the two genotypes is non-normal. Thus, to compare their means I have either used non-parametric tests or transformed the data to have normal distribution in each genotype and then used parametric tests. However, in 20 of the 120 individuals (10 of one genotype and 10 of the other), after I measured the initial response, I added a drug and measured again. To compare their means before and after the addition of the drug I should follow some related measurements test, like repeated measurements ANOVA, using only the values of the 20 individuals.

My question is, should I check again for normality in those 20 values, or assume that they should have non-normal distribution like the complete sample?

If these 20 values have a normal distribution, is it valid to use parametric tests only for these 20 individuals, despite the fact that the larger sample of 120 is clearly non-normal?

Thank you very much in advance.

2. Re: Distribution of a subset of the complete sample

Hi ampws,

If I understood correctly, you have a single variable measured for 120 individuals, separated in two groups. You then tested whether there was difference in those two groups. Now, for only 20 individuals you added a drug and measured again and want to test for differences between the initial state and the "after". I wouldn't rely on parametric tests that are based on normality assumptions for this case, first because of the sample size and also because of the non-normality found in the whole dataset. A safer approach would demand some non-parametric test for paired data, equivalent to a paired t-test, such as Wilcoxon's signed rank test. I'd recommend that over parametric tests, even if the subsample is approximately normally distributed.

Good luck

3. The Following User Says Thank You to terzi For This Useful Post:

ampws (05-23-2013)

4. Re: Distribution of a subset of the complete sample

Thank you very much terzi for your answer. I am most grateful. One thing I still need to clarify though:

Originally Posted by terzi

I wouldn't rely on parametric tests that are based on normality assumptions for this case, first because of the sample size and also because of the non-normality found in the whole dataset.
Even if the sample size was larger, would you still prefer to use non-parametric tests on the subsample because of the non-normality found in the whole dataset?

Again, thank you very much!

5. Re: Distribution of a subset of the complete sample

I would say 'yes' that is what they were alluding to, since the source had questionable normality. Wilcoxon's signed rank test should be a good fit for these data.

It was not clear to me, if you randomly gave some individuals the drug or how they were selected?

6. The Following User Says Thank You to hlsmith For This Useful Post:

ampws (05-27-2013)

7. Re: Distribution of a subset of the complete sample

The individuals that were given the drug were randomly selected.

Before posting this question I also thought that the larger sample is more trustworthy and defines what happens to random sub-samples. In fact, to my understanding that is the very essence of statistics: the distribution of the whole population should be similar to the distribution of any random sample. Otherwise, we would always require to measure every individual of the whole population. So, the statement above, should always apply. Then, if the distribution of random samples is different from the distribution of the whole population, this would be either because the sample size is too small, or because in fact the whole population is not actually one, but a mixture of at least two different populations. But, in our experimental setup we have no reason to believe that there are two or more populations of individuals. Of course, it would be best to check somehow this, but then it gets too complicated and in a fast screening of all of the values this does not seem to be the case.

In any case, I think now I am convinced about what I should do. Thank you both very much for your replies.

 Tweet