CI for p doesn't match hypothesis test.


Ambassador to the humans
I knew it was possible but never actually ran into it before with some real data. The confidence interval for p produced by R didn't match the results from the hypothesis test. R uses the Clopper-Pearson method by default to make the confidence intervals when using binom.test.

Here is the output for anybody interested. Like I said I knew it could happen theoretically but laughed when it finally actually happened to me so I thought I would share.

> binom.test(12, 99, .2011)

	Exact binomial test

data:  12 and 99
number of successes = 12, number of trials = 99, p-value = 0.04525
alternative hypothesis: true probability of success is not equal to 0.2011
95 percent confidence interval:
 0.06422779 0.20216260
sample estimates:
probability of success 
I would have run this classical one (the asymptotic method):

(These are a little bit off topic since the Ci and test match, but they are tests about the binomial distribution.)


phat <- 12/99 
se   <- sqrt(phat*(1-phat)/99)
lo   <- phat - 1.96*se
hi   <- phat + 1.96*se
# the "asymptotic" method

(phat - 0.2011)/se 

binom.test(12, 99, .2011)

binom.confint(12, 99, conf.level = 0.95, methods = "exact" )
binom.confint(12, 99, conf.level = 0.95, methods = "all" )

prop.test(c(12, 5),c(99,99))

#prop.test mentioned in: Chihara Hesterberg 
# "Mathematical Statistics with resampling and R" 2011,  page 195

# Brown Interval Estimation for a Binomial Proportion
But then then I came across “binom.confint” function.

And the function “prop.test” was mentioned in Chihara Hesterberg.

Brown et. al. had this fun conclusion:

“The standard Wald interval is in nearly universal use. We first show that the performance
of this standard interval is persistently chaotic and unacceptably poor. Indeed its coverage
properties defy all conventional wisdom, much more than is presently widely understood.
The performance is so erratic and the qualifications given in the in the influential texts are so
defective, that the standard interval should not be used. We provide a fairly comprehensive
evaluation of many natural alternative intervals. Based on this analysis, we recommend the
Wilson or the equal-tailed Jeffrey prior interval for small n (n <= 40), and the Agresti-Coull
interval for n >= 40. Even for small sample sizes the easy to present Agresti-Coull interval
is much preferable to the standard one.”

Lawrence D. Brown, T. Tony Cai and Anirban DasGupta
“Interval Estimation for a Binomial Proportion”