# testing the bootstrap

#### rogojel

##### TS Contributor
hi,
below is an experiment I just did testing whether using a bootstrap I could get better results as with simple repetead sampling. Is there any error in the logic/code? It seems that bootstrapping would just amplify the sampling error without adding any value - what do I miss?

This is actually an R Notebook file, I just did not find a better way to include it here.

Bootstrap with t-tests

Being fascniated by the possibilities of simulations in statistics I would like to test some ideas of using the bootstrap with statistical tests. For illustration I picked the simple case of two-sample t-tests. My question is if we can get a better view of the test results by using the bootstrap. The real-life situation would be, to gather one sample, possibly a small one, and try to make the results more conclusive by bootstrapping.

So, first, let us take two samples from two normally distributed populations that differ in the mean only. According to the sample size calculations, we have a power of roughly 80% to detect a difference. Let us repeat the sampling 1000 times and check the p-value distribution.

Code:
Len=1000
pval=numeric(Len)
for(i in 1:Len){
x=rnorm(17, 1, 1)
y=rnorm(17,2,1)
pval[i]=t.test(x,y, alternative="two.sided")$p.value } x=hist(pval, breaks=20) print(sprintf("Percent of tests with rejected NULL %.3f", x$counts/Len))

##  "Percent of tests with rejected NULL 0.812"
This is all nicely according to the theory. Now, let us take the case of a test, that accidentally has a higher then 0.05 pvalue even if the population means differ by 1 unit.
Code:
#First let us find a sample like this
Len=1000
x=numeric(Len)
y=numeric(Len)
for(i in 1:Len){
x<<-rnorm(17,1,1)
y<<-rnorm(17,2,1)
if(t.test(x,y,alternative="two.sided")$p.value>0.1){ break } } print(sprintf("So, the %d th trial resulted in a p-value of %f", i, t.test(x,y,alternative="two.sided")$p.value))

##  "So, the 14 th trial resulted in a p-value of 0.297243"
Now, let us see if bootstrapping this one particular sample will give us an idea that this is nothing but a sampling error?
Code:
Len=500
pval=numeric(Len)
for(i in 1:Len){
x1=sample(x,17, replace=TRUE)
y1=sample(y,17,replace=TRUE)
pval[i]=t.test(x1,y1,alternative="two.sided")$p.value } x=hist(pval, breaks=20) print(sprintf("Percent of tests with rejected NULL %.3f", x$counts/Len))

##  "Percent of tests with rejected NULL 0.206"
So, it seems that in case of an moderately low p-value we can get an indication that there is a difference between the two groups, just by using the bootstrap, without the need of taking a new sample.

Let us see what hapens if the -value is much larger, though the difference in the populations is there.

Code:
Len=1000
x=numeric(Len)
y=numeric(Len)
for(i in 1:Len){
x<<-rnorm(17,1,1)
y<<-rnorm(17,2,1)
if(t.test(x,y,alternative="two.sided")$p.value>0.3){ break } } print(sprintf("So, the %d th trial resulted in a p-value of %f", i, t.test(x,y,alternative="two.sided")$p.value))

##  "So, the 17 th trial resulted in a p-value of 0.429053"
Does the bootstrap help?
Code:
Len=500
pval=numeric(Len)
for(i in 1:Len){
x1=sample(x,17, replace=TRUE)
y1=sample(y,17,replace=TRUE)
pval[i]=t.test(x1,y1,alternative="two.sided")$p.value } x=hist(pval, breaks=20) print(sprintf("Percent of tests with rejected NULL %.3f", x$counts/Len))

##  "Percent of tests with rejected NULL 0.156"
Nope, for extreme samples it will not help. Resampling will be the only remedy here. Interestingly, we got a sample that seems to conclusively demonstrate that there is no difference between the two populations, although there is one. This occurred with a probability of 1% roughly, so actually, I would conclude that small small samples are not really trustworthy, whatever the power calculation.

How about false alarms? Let us take two samples that are quite similar, the difference in mean values being 0.2 and repeat the exercise.
Code:
Len=1000
pval=numeric(Len)
for(i in 1:Len){
x=rnorm(17, 1, 1)
y=rnorm(17,1.2,1)
pval[i]=t.test(x,y, alternative="two.sided")$p.value } x=hist(pval, breaks=20) print(sprintf("Percent of tests with rejected NULL %.3f", x$counts/Len))

##  "Percent of tests with rejected NULL 0.108"
Now let us take two samples where the p-value is about 0.1 and see if we can use the bootstrap to determine that the two populations are in fact similar? The expectation would be to get a lot less tests with a p-value of less then 0.05 then in the case of the actually different populations.
Code:
Len=1000
x=numeric(Len)
y=numeric(Len)
for(i in 1:Len){
x<<-rnorm(17,1,1)
y<<-rnorm(17,2,1)
if(t.test(x,y,alternative="two.sided")$p.value>0.1){ break } } print(sprintf("So, the %d th trial resulted in a p-value of %f", i, t.test(x,y,alternative="two.sided")$p.value))

##  "So, the 3 th trial resulted in a p-value of 0.111358"
Now let us see the bootstrap results.
Code:
Len=500
pval=numeric(Len)
for(i in 1:Len){
x1=sample(x,17, replace=TRUE)
y1=sample(y,17,replace=TRUE)
pval[i]=t.test(x1,y1,alternative="two.sided")$p.value } x=hist(pval, breaks=20) print(sprintf("Percent of tests with rejected NULL %.3f", x$counts/Len))

##  "Percent of tests with rejected NULL 0.404"
And no! So the bootstrap will simply see the imbalance in the sampling and is unable to distinguish between samples that come from populations with different means or similar ones - it will only amplify the sampling error.

#### hlsmith

##### Less is more. Stay pure. Stay poor.
Side note, if I remember correctly the standard sampling approaching are a little bit off, that is why Efron came up with the 0.632+ bootstrap version.

#### hlsmith

##### Less is more. Stay pure. Stay poor.
This form of the bootstrap is most often used within cross-validation to address that the observations within the folds are not completely independent.