# confidence interval for combined data

#### Trey

##### New Member
Hi everyone,

I'm new here, so forgive me if I don't get something quite right. I have a question for how best to do a confidence analysis of a set of data sliced pretty thinly. I've looked around on the web and usenet for a good place to ask, and folks around here seem to know what they are talking about. The situation is sufficiently complicated that I can't even figure out how to go about looking for the right way to analyze it. Any help or pointers would be appreciated.

So, I have a number of documents and a black-box analyzer that assigns scores from 0 to 100 to each document. Each document also has been categorized by a human reader into two categories, which I'll call "good" and "bad". In general, higher scores from the black-box analyzer correlate higher with "good" categorization, but of course it isn't perfect.

The data is sliced by analyzer score and the accuracy at each score is calculated. Taking the top three scores, it might be:

(table 1)
score: 100; 250 documents, of which 225 are "good", so the accuracy estimate is 90%
score: 99; 500 documents, of which 400 are "good", so the accuracy estimate is 80%
score: 98; 1000 documents, of which 600 are "good", so the accuracy estimate is 60%
etc...

So, if I want a confidence interval for each, I've been using the confidence interval for binomial probabilities (that is, the "Normal approximation interval" at http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval ). So far, so good, I think (though tell me if I'm wrong, please).

I also need to slice the data cumulatively by score, so "99+" means 99 or higher, etc:

(table 2)
score: 100; 250 documents, of which 225 are "good", so the cumulative accuracy estimate is 90%
score: 99+; 750 documents, of which 625 are "good", so the cumulative accuracy estimate is 83.3%
score: 98+; 1750 documents, of which 1225 are "good", so the cumulative accuracy estimate is 70%
etc...

I can certainly plug in the values in the confidence interval calculations, and it even seems like it might be right, but I fear I'm missing something.

From there, things get complicated. Sometimes the documents are sampled. Sampling is random, but not at the same rate for all scores (don't ask why.. ugh). So, we might see data like this:

(table 3)
score: 100; 250 documents, of which 100 were reviewed and 90 are "good", so the accuracy estimate is 90%
score: 99; 500 documents, of which 100 were reviewed and 80 are "good", so the accuracy estimate is 80%
score: 98; 1000 documents, of which 150 were reviewed and 90 are "good", so the accuracy estimate is 60%
etc...

Clearly with sampling, the confidence intervals will be larger.

It is straightforward, though tedious, to extrapolate from each row in the table to the estimated number of "good" documents with each score.. At score 100, 90% accuracy on 250 documents gives an estimate of 225 "good" documents; at score 99, 80% accuracy on 500 documents gives an estimate of 400 "good" documents, so at a cumulative score of "99+" we have an estimate of 625 "good" documents.

(In this case the numbers are ridiculously consistent, so the cumulative table would be the same as table 2 above.)

The documents with a score of 100 are sampled at a rate of 40%, those with a score of 99 are sampled at a rate of 20%, those with a score of 98 are sampled at a rate of 15%. My question now is how do you give confidence intervals for the cumulative sampled accuracy estimates? Given the need to extrapolate at each score level in order to properly weight the contribution based on the number of documents it represents, I am completely without a clue as to how to proceed with combining the confidence intervals.

Also, any advice or concerns on how to proceed with small numbers (e.g., there are 100 documents, sampled at a rate of 5%) would be much appreciated. For the non-sampled reports I use the Wilson score interval ( see Wilson score interval on the wikipedia page above).

Thanks a lot,
-Trey

#### Trey

##### New Member
So, is my question too difficult to answer easily, or too silly to be bothered with?

#### TheEcologist

##### Global Moderator
So, is my question too difficult to answer easily, or too silly to be bothered with?
Oke, so you have 'lots' of data on documents, each with a score 0 - 100 calculated through some unknown algorithm. You then subset each of these (e.g. the 100's the 99's ect) and then see if a human scored them good or bad?

You never really explicitly state what you want to know (thats probably why nobody replied) so I'm guessing you want to know is your method for confidence intervals is a good one?

Your method resides on the assumption that the binomial distribution describes the variance per subset (and that seems reasonable to me). An alternative method would be to bootstrap confidence intervals. This method is superior in my opinion as you make less assumptions (you only assume that the central limit theory holds true, which is quite a safe bet).

See:
http://www.modelselection.org/bootstrap/

The bootstrap is also applicable when documents are sampled.

#### Trey

##### New Member
You never really explicitly state what you want to know (thats probably why nobody replied) so I'm guessing you want to know is your method for confidence intervals is a good one?
Thanks for letting me know I need to clarify. My first question was whether my most basic assumption about computing the confidence interval was correct. You seem to agree that it is not unreasonable, so that's good news.

My much more complicated problem is at the end of the original message, where the different subsets are sampled at different rates for categorization by a human judge as "good" or "bad", and I want to generate a confidence interval on the percentage of "good" documents for the cumulative subsets, which combine the stratified subsets that are sampled differently. I'm not sure how to combine the different sample rates to come up with an overall confidence interval.

An alternative method would be to bootstrap confidence intervals. This method is superior in my opinion as you make less assumptions (you only assume that the central limit theory holds true, which is quite a safe bet).

See:
http://www.modelselection.org/bootstrap/
Okay.. I had a quick look at the bootstrap methods.. the first thing that popped out at me was the large number of samples. Somehow I'm guessing that doing 10,000 random samples with replacement over a set of 100 or fewer documents is going to give unreliable results. But I'll keep reading. Thanks!

Any other guidance from anyone else would still be be much appreciated. Thanks!

Last edited:

#### TheEcologist

##### Global Moderator
Okay.. I had a quick look at the bootstrap methods.. the first thing that popped out at me was the large number of samples. Somehow I'm guessing that doing 10,000 random samples with replacement over a set of 100 or fewer documents is going to give unreliable results. But I'll keep reading. Thanks!
No you can still bootstrap with 100 samples (thats fine), you can still bootstrap with 10. It just the less samples you have the more doubts you should have about your CI. With a sample of 10 its really meaningless. You see, your sample should be a good representation of the actual population. So thinking of this in terms of ' doing 10,000 random samples with replacement over a set of 100 or fewer' is folly. You should think ' is my sample of x representative of the population'. For instance: A sample of 100 from a population of 1000 is very good and if sampled "unbiasedly" and random, highly representative. A sample of 10 from a population of 1 000 000 is much more suspect even if sampled random and unbiased.

Secondly the binomial variance is a function of sample size [np(1-p)] so it should correct for any uncertainty caused by a low sample size (ergo large CI's). A nice (free) program to check your CI's is http://www.eco.rug.nl/~knypstra/Pqrs.html from the university of Groningen.

#### Trey

##### New Member
Thanks for the explanation Ecologist.. the picture becomes clearer, but still far from crystal.

Unfortunately I have no control over the size of the samples available to me. I know full well that the samples are not particularly representative because they are small. I want to be able to calculate a meaningful confidence interval so I can tell the people who might be using these numbers as if there were both meaningful and precise how unmeaningful and imprecise they really are. (Some of them won't just take my word for it.) They are comparing an estimated accuracy of 93.4% to an estimated accuracy of 92.7% and saying things like, "Hey, you promised this would be at least 93%! You failed!" And I'm trying to get them to understand that when the confidence interval for both numbers is on the order of +/- 10%, the two numbers are statistically indistinguishable. They almost get that, but now they need actual confidence intervals.

So, getting away from the psychology of the situation... a simplified example.

If I have set A, with 1,000 members, and subset A', the 100 members of A for which I have human category judgments; and I have set B, with 2,000 members, and subset B', the 500 members for which I have human category judgments, how do I do my random sampling from the union of A' and B'? A' is a 10% sample, B' is a 40% sample. Do I sample A' four times as often? That seems quite wrong, intuitively. (OTOH, I don't expect my intuitions on bootstrapping to be particularly good, so please correct me.)

I know that taking a sub-sub-sample, B'' that was 25% of B' (and thus 10% of B) would even things out, and union(A',B'') would be a random 10% sample of union(A,B). But in my case, given that I can't control the rate of human review for the subsets, if I normalized all my samples to the smallest sample rate I'd probably end up with single documents (or fractional documents!) being the entire sample for certain subsets. I also think that in that case I could go back to the binomial CI using the Wilson score interval (which is less computationally expensive).

So.. what did I misunderstand this time? There must be something.

And again.. any suggestions for an analytical solution to my original conundrum is much appreciated.