# Nonparametrics, which test should I use?

#### Amana

##### New Member
Hello.
I am a veterinary student who is stuck in some statistics.
I am doing an experiment with mice. I have two treatment groups and have measured the weights (in grams) of each mouse weekly to see if there is any difference in weight gain between the two groups.
I have divided them by sex, because it is evident than male mice gain more weight than females irrespectively of treatment group. I have checked the weight before beginning treatment, and the two groups did not have statistically significant starting weight.
I have checked for normal distribution with anderson-darling test and my data are not normally distributed, because I have one outlier in each group.
- What test should I use?
My sample sizes are 9 for group 1 and 10 for group 2. So they are too small for Mann-Whitney test?
Can I use Kruskall Wallis test when I only have two groups?

I would be really happy for some answer! Thanks.

The data (in grams for week 4 into treatment):
Group 1:
18,2
18,6
18,0
18,4
19,0
16,5
18,3
18,6

Group 2:
17,8
17,2
19,1
16,5
16,7
16,6
16,9
16,6
16,8

#### mostater

##### New Member
Hi Amana,

The Kruskal-Wallis test should give you similar if not an identical answer as the Mann-Whitney U test. Having a small sample is actually a good reason to use a non-parametric test.

Even though there were no pre differences, you may want to consider comparing the change (post less pre) between groups via the Mann-Whitney U test. That might actually be of more interest as it takes into account the starting weight.

#### gianmarco

##### TS Contributor
Hi,
in my opinion, the size of your sample should not guide you in the decision about which test (parametric vs non-param) to use. Is the meeting of the assumptions that should guide you, in my opinion.
I believe that, if you do not have the possibility to widen your samples, you have to work with them. The problem with small sample size is that you will have less power (that is, the capacity to detect a difference).
I think that, unless t-test assumptions are met, MW is the right option.

Hope this helps
Regards
Gm

#### GretaGarbo

##### Human
My sample sizes are 9 for group 1 and 10 for group 2.
(I am not so good with counting but I can only se 8 numbers for group1 and 9 for group 2.)

Maybe some numbers are missing. It is quite important because it is almost significant with the Wilcoxon-Mann-Whitney test.

But maybe one can do a completely different test ….

#### trinker

##### ggplot2orBust
Amana said:
I have checked for normal distribution with anderson-darling test and my data are not normally distributed, because I have one outlier in each group.
Rarely is data normally distributed. The assumption regarding normality isn't about the data it's about the residuals. The low n and the two outliers leads me to wonder if this may be an occasion for a permutation test or boostrapping.

#### GretaGarbo

##### Human
...the two outliers leads me to wonder if this may be an occasion for a permutation test...
That’s exactly what I was thinking of.

But lets se if Amana can pick out two extra observations.

.

#### gianmarco

##### TS Contributor
I was thinking quite the same, but I have kept myself from suggesting bootstrapping methods since (as far as I know) I am aware of some issues relative to the size of the samples to bootstrap.

Gm

#### Amana

##### New Member
Hi.
Thanks for the answers. Sorry for the confusion about sample sizes. That were my original sample sizes, but unfortunately I lost one mouse from each group (they died). I was really unlucky with my mating and sizes of the litters, so I have fewer mice than I expected.
I read somewhere that I should look at if the distribution is normal in the population, not in the samples. In that case, weight is something that is normally distributed in the population of mice, and therefore I can use 2 sample t-test?
Concerning bootstrapping and permutation. I don't know what that is, I have to look it up.

#### Amana

##### New Member
By the way. The two outliers were deviant from the beginning (but not really outliers). So I think it is a good idea to look at the weight gain in each mouse relative to the starting weight instead of the exact weight at each week.

#### GretaGarbo

##### Human
This is really interesting, I think. We (or Amana) have got a really small sample with some real difficulties just what often happens in real investigations. (Suppose Amana had n=500 values in each group, then it would have been quite boring to discuss.)

Amana wrote:
“I read somewhere that I should look at if the distribution is normal in the population, not in the samples.”
Well, this is correct. I am just so used to check the normality assumption with the sample so this is almost kind of forgotten.

“In that case, weight is something that is normally distributed in the population of mice, and therefore I can use 2 sample t-test?”

I guess that the Veterinary school in Copenhagen, (that’s my guess where Amana is), knows a lot about the weight of mice. So I believe that it true that the weight of mice is normally distributed. Maybe it is possible to get some values for the standard deviation from a much larger sample; to get the “population” values for the sigma. Then one could do a usual “z-test”. (I guess that the reader know what I mean by that.)

But another and more difficult question is if it is normally distributed in this particular experimental setting. The outliers warn us that something strange has happened.

Amana notes:

“I have divided them by sex, because it is evident than male mice gain more weight than females irrespectively of treatment group.”
So there is a difference between sexes. Amana has balanced that in the design, and that is good, but sex is still an explanatory factor.

“I was really unlucky with my mating and sizes of the litters, so I have fewer mice than I expected.”
Is it so that there is a litter effect? Brothers and sisters are often quite similar. That could make some values to be quite similar and others to become outliers. There seems to be an litter-factor.

“but unfortunately I lost one mouse from each group (they died)”
We could think about this as a completely “outside” event so that their values are missing completely at random (MCAR). But we can also suppose that if they had lived then they would have had a low weight. So then that would have been a censored variable. We can imagine that their values would have been less than say 16. Then we can calculate the likelihood (L(x<16;mu)) for a parametric distribution.

Then these two mice would give us a little bit more information.

“The two outliers were deviant from the beginning (but not really outliers). So I think it is a good idea to look at the weight gain in each mouse relative to the starting weight instead of the exact weight at each week.”
I agree. It would be good if you showed the figures for the original weight and the “new weight". Then one could calculate the difference if one wishes, or the relative change (weight_new/weight_old) or the log of that (or any other transformation).

Now, a fundamental question to Amana: did you randomise the allocation of each mouse to the two treatments groups? (by formally picking random numbers or tossing a coin? i.e. not just chosen arbitrarily - that is not "random".)

Maybe a little bit more complicated model is needed, but let us try to face the complications in reality.

Isn’t this very fascinating? We have got some clues. It is like solving a detective story!

I have written down these complications to ask the forum for advice for what is important and what is not important.

.

#### GretaGarbo

##### Human
I had hoped that someone would comment on the reflections above. If they agree or disagree. But obviously I wrote it so long that I made you fall a sleep. Sorry about that!

I entered Amanas data in R.

R is an open source software. It is free to download. There is a lot of information about R on this forum, also information on tutorials.

You can download R and run the command that I showed in the code below. (I am also using the Tinn-R editor) But I don’t recommend you to try to learn R. It is to demanding if you just want to use it occasionally.

Amana can copy each row of code one at a time and paste it in into R and hit the enter key (if you are not familiar with R).

There are also a lot of people here who are really good in R. Having easy access to the data makes it easier for them to do alternatives to what I have done.

Code:
x1 <- c(18.2,18.6,18.0 ,18.4 ,19.0 ,16.5 ,18.3 ,18.6 )
x2  <-  c(17.8 ,17.2 ,19.1 ,16.5 ,16.7 ,16.6 ,16.9 ,16.6 ,16.8 )

boxplot(x1,x2)
t.test(x1,x2)
#t = 2.7662, df = 14.997, p-value = 0.01441

hist(x1)
qqnorm(x1)
hist(x2)
qqnorm(x2)

#Mann-Whitney-Wilcoxon test
wilcox.test(x1,x2)
#haha:  W = 56.5, p-value = 0.05385  "nearly significant"

#permutations from Maria Rizzo page 218
# Rizzo M., Statistical computing with R, 2008
choose(8+9,8) #total number of different partitions of the sample

R <- 9999      #number of replicates
z <- c(x1,x2) #pooled sample
K <-  1:17    # 8+9 sampled units

reps <- numeric(R) #storage for replications
t0 <- t.test(x1,x2)\$statistic

for (i in 1:R) {
#generate indices for the first sample
k <- sample(K,size=8,replace=FALSE)
x1g <- z[k]
x2g <- z[-k] #complement of x1g
reps[i] <- t.test(x1g,x2g)
}
p <- mean(c(t0,reps) >= t0)

#ASL = the Achieved Significance Level
print(c("the Achieved Significance Level is:",p))
print(c("standard error of ASL is:",round(sqrt(p*(1-p)/R),digits=6)))

print(t0)
hist(as.numeric(reps),breaks = 100,xlab='test results (observed value in red)',main='Test under HO')
abline(v=t0,col="red") #Add vertical line at x=2.76
# the observed value t=2.76 (in red) is far out to the rights.
# Thus it is a sign difference

Results:

The two samples deviates from normality. Outliers can be seen in the boxplot.
A t-test showed a statistical “significance”, but that can be questioned due to the apparent non-normality.

A Wilcoxon-Mann-Witney did not give a significant result, but very close, pvalue=5.4%.

(A two samples Wilcoxon and Mann-Whitney is the same test.)

But how reliable is the original t-test? To investigate that, a permutation test was done.

(I just did the test 10,000 times. It took about 15 seconds!)

Imagine that there was no effect at all from the treatment. That is, H0 is true. Then the labels of “group 1” and “group 2” would have no real meaning, since there would be no difference between the treatments. Then you could re-label them in many ways and in each case you could run a statistical test, for example the t-test. If we do that t-test for many combinations – permutations, then we will get information about how the t-test works under the null hypothesis for this data. Then we can compare the observed t-test for the correct labelling of the groups, with the whole distribution of hypothetical values. If the observed t-test is far out in one of the tails – in the 5% area – then the test is statistically significant.

"the Achieved Significance Level is:" "0.0075"
"standard error of ASL is:" "0.000863"

Since the Achieved Significance Level ="0.0075" is less than 0.05 it is statistically significant at the 5% level.

(Note that the Achieved Significance Level (ASL) will be different each time since it is based on random sampling.)

Look at the diagram below:

The observed value t=2.76 (in red) is far out to the rights.
Thus it is a significant difference.

.

#### gianmarco

##### TS Contributor
Hi Greta Garbo,
I very welcome the fact that you found some time to perform the tests and to provide the relative explanations. It will surely be useful for the user as well as for other members that will jump in this thread in the future.
When I have the time (so rarely in this last months) I like to "play" with the data provided by the users.....

By the way, your profile image is one of the cutest I have ever seen.

Regards
Gm

#### GretaGarbo

##### Human
I was thinking quite the same, but I have kept myself from suggesting bootstrapping methods since (as far as I know) I am aware of some issues relative to the size of the samples to bootstrap.
I hope the sample size will not play any tricks with us. Since permutations is done without replacement, all the 17 units will be selected each round, (but with different labels) so I hope it is OK. Suggestions are welcome.

I mean, the reason for doing this is to learn something (and I am not experienced with permutation tests).

#### Dason

You can download R and run the command that I showed in the code below. (I am also using the Tinn-R editor)
Tinn-R eh? Interesting choice. I've never been a big fan of it. Have you tried RStudio? It's very nice.

#### Amana

##### New Member
Hi.
Thanks for the answers. I have been gone for some days, but now I am back in the office.

I am using Minitab. I tried R, but I feel Minitab is much easier to use.

I have not yet calculated anything on litter effect. There could be one. Large litters hav esmaller pups and small litters have larger pups. Also, there could be some genetic diseases transmitted so that all pups in one litter become smaller (I have some problems with physical development, and that could interfere with my results). I euthanized one pup due to underdevelopment of an eye and very slow weight gain.

I randomized all mice into treatment group. They were all given a number in the ear and then I ramdomized them into treatment with the limitations that I had to separate females and males (so that females would not become pregnant), there should not be more that 2 days between day of birth for the mice in each cage and that treatment had to be done on cage-level (all mice in each cage recieved the same treatment).

I have made some calculations on weightgain compared to starting weight (right before treatment). When doing that, I can see no significant difference in weight gain. I think that is a more appropriate method, since they don't weigh the same from the beginning, even thought it is not a significant difference from the beginning it is a factor to take into account.

I will attach my dataset with all the weights. I have weighed them once a week and calculated the weight gain in percent and the absolute weight gain in grams. I am not sure which one of these are the most appropriate.
I will do some research about statistics for weight gain in mice...

#### GretaGarbo

##### Human
So all the males were taken away from the experiment?

Did you randomise each individual mouse to treatment, or was it the whole cage and the mice in it that was randomised to treatment?

So the group variable is the treatment variable?

Data are from d27.

Weight d0 was lost on mouse: 16 46. Baseline value seems gone. What happened?

#### gianmarco

##### TS Contributor
Hi Greta Garbo,
I am wondering the following.

In a previous post of mine, I wrote:
I was thinking quite the same, but I have kept myself from suggesting bootstrapping methods since (as far as I know) I am aware of some issues relative to the size of the samples to bootstrap.
I am wondering if my phrasing turned up to sound unpolite/rude?
I am not a native English speaker, so I apologize myself if there has been unintentionally something wrong with my reply.

Regards,
Gm

#### Amana

##### New Member
So all the males were taken away from the experiment?

Did you randomise each individual mouse to treatment, or was it the whole cage and the mice in it that was randomised to treatment?

So the group variable is the treatment variable?

Data are from d27.

Weight d0 was lost on mouse: 16 46. Baseline value seems gone. What happened?
No, the males are also included. But there are only 4 + 3 males. So I have an even bigger problem with sample sizes there. I just picked out the females because I have more to calculate on. I will do the same calculations for the males.

My randomization was a little bit complicated.
First I divided them into females and males.
The mice had numbers in their ears and were put into order.
I wrote the treatments into a randomization program. If I had 10 mice I wrote 5 ob and 5 B6 (the two treatments). The mice got the treatment which came into the order that appeared in the program.
Then they were collected into cages based on treatment and birthday (mice born with no more than 2 days apart could go into the same cage). If there were more than 5 mice these were furher divided into several cages (in a random order).

Yes, the group variable is the treatment variable.

Mice 16 and 46 were mixed up at day 0. I am unexperiensed in looking at earclips in mice so I got confused about mouse numbers. Therefore, I don't trust my data from day 0 for these two (I have two wheights, but don't know which mouse of these two it belongs to).

Number 3 escaped from the cage (not long distance, but since my experiment has to do with bacteria it could have gotten infected from other mice, so it was excluded from the study).
Number 15 was underdeveloped and eutanized.

#### GretaGarbo

##### Human
I don’t know how to evaluate this experiment. I just try to make suggestions to this community and then we will se if anybody agree.

@Amana. So you have some seven extra values for the males. Why don’t you include those values in the data table – the excel-file? Possibly one could evaluate them as an extra factor together with the treatment factor. That would be like a two-way layout. Of course it is better to have 17+7 units than just 17 units.

One mouse escaped so that one is missing completely at random (MCAR). (And also literally “missing” for a while. )
One died early so that one is MCAR. But if it had died late in the process it could have been due to the treatment and should have been evaluated as a censored value (e.g. weight less than 16). The MCAR values can be ignored.

Two values were mixed up. Ok, such things happen (more often than we would like to know). But Amana admits it (and that’s a strength) and is a real scientist and don’t want to any incorrect things, and that’s an attitude that is commendable.

But what can we do with two mixed up values, I ask the community?
We can a) throw away all information for these mice.
We can b) impute the two missing values from the other values. (From the two values of from all values.)
We can c) randomise the two values to each of the mice. That procedure could be defended on frequentist reasoning that in the long run it will be correct.
We could d) write down the likelihood for the experiment. We have the likelihood for each of the other data points and then the likelihood is one half for each of the alternatives for the two mixed up values. (This sounds like a Bayesian vagueness argument but it is not intended to be that.) Although it would be a little bit technically demanding with non-standard software and all that, but in principle it would be possible.

So I suggest that Amana report the other two values “by the side” so let us we what it looks like and what the other participants in this community says.

Is the “ob” the non-treated and the “B6” the treatment?

Gianmarco wrote:
“I am wondering if my phrasing turned up to sound unpolite/rude?”
No no, of course not! But if you have views or “are suspicious” about small samples in permutation test or bootstrapping, please tell us about that. (I must say that I have no idea about small samples restrictions in bootstrapping.)

gianmarco wrote:
“I am not a native English speaker, “
Neither am I, so I take the opportunity to apologize the English-speaking people for miss-use of their language. (Hmm, I don’t think I got that English correct.) And I also apologize for any unintended rudeness!

#### gianmarco

##### TS Contributor
But if you have views or “are suspicious” about small samples in permutation test or bootstrapping, please tell us about that.
In this PDF there is a small section on "Bootstrapping small samples". Hope this can prove to be useful.

Best Regards,
Gm