GretaGarbo (04-21-2017)
hi,
below is an experiment I just did testing whether using a bootstrap I could get better results as with simple repetead sampling. Is there any error in the logic/code? It seems that bootstrapping would just amplify the sampling error without adding any value - what do I miss?
This is actually an R Notebook file, I just did not find a better way to include it here.
Bootstrap with t-tests
Being fascniated by the possibilities of simulations in statistics I would like to test some ideas of using the bootstrap with statistical tests. For illustration I picked the simple case of two-sample t-tests. My question is if we can get a better view of the test results by using the bootstrap. The real-life situation would be, to gather one sample, possibly a small one, and try to make the results more conclusive by bootstrapping.
So, first, let us take two samples from two normally distributed populations that differ in the mean only. According to the sample size calculations, we have a power of roughly 80% to detect a difference. Let us repeat the sampling 1000 times and check the p-value distribution.
This is all nicely according to the theory. Now, let us take the case of a test, that accidentally has a higher then 0.05 pvalue even if the population means differ by 1 unit.Code:Len=1000 pval=numeric(Len) for(i in 1:Len){ x=rnorm(17, 1, 1) y=rnorm(17,2,1) pval[i]=t.test(x,y, alternative="two.sided")$p.value } x=hist(pval, breaks=20) print(sprintf("Percent of tests with rejected NULL %.3f", x$counts[1]/Len)) ## [1] "Percent of tests with rejected NULL 0.812"
Now, let us see if bootstrapping this one particular sample will give us an idea that this is nothing but a sampling error?Code:#First let us find a sample like this Len=1000 x=numeric(Len) y=numeric(Len) for(i in 1:Len){ x<<-rnorm(17,1,1) y<<-rnorm(17,2,1) if(t.test(x,y,alternative="two.sided")$p.value>0.1){ break } } print(sprintf("So, the %d th trial resulted in a p-value of %f", i, t.test(x,y,alternative="two.sided")$p.value)) ## [1] "So, the 14 th trial resulted in a p-value of 0.297243"
So, it seems that in case of an moderately low p-value we can get an indication that there is a difference between the two groups, just by using the bootstrap, without the need of taking a new sample.Code:Len=500 pval=numeric(Len) for(i in 1:Len){ x1=sample(x,17, replace=TRUE) y1=sample(y,17,replace=TRUE) pval[i]=t.test(x1,y1,alternative="two.sided")$p.value } x=hist(pval, breaks=20) print(sprintf("Percent of tests with rejected NULL %.3f", x$counts[1]/Len)) ## [1] "Percent of tests with rejected NULL 0.206"
Let us see what hapens if the -value is much larger, though the difference in the populations is there.
Does the bootstrap help?Code:Len=1000 x=numeric(Len) y=numeric(Len) for(i in 1:Len){ x<<-rnorm(17,1,1) y<<-rnorm(17,2,1) if(t.test(x,y,alternative="two.sided")$p.value>0.3){ break } } print(sprintf("So, the %d th trial resulted in a p-value of %f", i, t.test(x,y,alternative="two.sided")$p.value)) ## [1] "So, the 17 th trial resulted in a p-value of 0.429053"
Nope, for extreme samples it will not help. Resampling will be the only remedy here. Interestingly, we got a sample that seems to conclusively demonstrate that there is no difference between the two populations, although there is one. This occurred with a probability of 1% roughly, so actually, I would conclude that small small samples are not really trustworthy, whatever the power calculation.Code:Len=500 pval=numeric(Len) for(i in 1:Len){ x1=sample(x,17, replace=TRUE) y1=sample(y,17,replace=TRUE) pval[i]=t.test(x1,y1,alternative="two.sided")$p.value } x=hist(pval, breaks=20) print(sprintf("Percent of tests with rejected NULL %.3f", x$counts[1]/Len)) ## [1] "Percent of tests with rejected NULL 0.156"
How about false alarms? Let us take two samples that are quite similar, the difference in mean values being 0.2 and repeat the exercise.
Now let us take two samples where the p-value is about 0.1 and see if we can use the bootstrap to determine that the two populations are in fact similar? The expectation would be to get a lot less tests with a p-value of less then 0.05 then in the case of the actually different populations.Code:Len=1000 pval=numeric(Len) for(i in 1:Len){ x=rnorm(17, 1, 1) y=rnorm(17,1.2,1) pval[i]=t.test(x,y, alternative="two.sided")$p.value } x=hist(pval, breaks=20) print(sprintf("Percent of tests with rejected NULL %.3f", x$counts[1]/Len)) ## [1] "Percent of tests with rejected NULL 0.108"
Now let us see the bootstrap results.Code:Len=1000 x=numeric(Len) y=numeric(Len) for(i in 1:Len){ x<<-rnorm(17,1,1) y<<-rnorm(17,2,1) if(t.test(x,y,alternative="two.sided")$p.value>0.1){ break } } print(sprintf("So, the %d th trial resulted in a p-value of %f", i, t.test(x,y,alternative="two.sided")$p.value)) ## [1] "So, the 3 th trial resulted in a p-value of 0.111358"
And no! So the bootstrap will simply see the imbalance in the sampling and is unable to distinguish between samples that come from populations with different means or similar ones - it will only amplify the sampling error.Code:Len=500 pval=numeric(Len) for(i in 1:Len){ x1=sample(x,17, replace=TRUE) y1=sample(y,17,replace=TRUE) pval[i]=t.test(x1,y1,alternative="two.sided")$p.value } x=hist(pval, breaks=20) print(sprintf("Percent of tests with rejected NULL %.3f", x$counts[1]/Len)) ## [1] "Percent of tests with rejected NULL 0.404"
GretaGarbo (04-21-2017)
Side note, if I remember correctly the standard sampling approaching are a little bit off, that is why Efron came up with the 0.632+ bootstrap version.
Stop cowardice, ban guns!
GretaGarbo (04-21-2017), rogojel (04-21-2017)
Hi,
what is that?
Well it is a type of bias correction version. I know it exists, but haven't used it. See this PowerPoint for a little more information on it along with hypothesis testing.
http://www.biometrische-gesellschaft...any_Part_I.pdf
Stop cowardice, ban guns!
This form of the bootstrap is most often used within cross-validation to address that the observations within the folds are not completely independent.
Stop cowardice, ban guns!
Tweet |