I'm new here, so forgive me if I don't get something quite right. I have a question for how best to do a confidence analysis of a set of data sliced pretty thinly. I've looked around on the web and usenet for a good place to ask, and folks around here seem to know what they are talking about. The situation is sufficiently complicated that I can't even figure out how to go about looking for the right way to analyze it. Any help or pointers would be appreciated.

So, I have a number of documents and a black-box analyzer that assigns scores from 0 to 100 to each document. Each document also has been categorized by a human reader into two categories, which I'll call "good" and "bad". In general, higher scores from the black-box analyzer correlate higher with "good" categorization, but of course it isn't perfect.

The data is sliced by analyzer score and the accuracy at each score is calculated. Taking the top three scores, it might be:

(table 1)

score: 100; 250 documents, of which 225 are "good", so the accuracy estimate is 90%

score: 99; 500 documents, of which 400 are "good", so the accuracy estimate is 80%

score: 98; 1000 documents, of which 600 are "good", so the accuracy estimate is 60%

etc...

So, if I want a confidence interval for each, I've been using the confidence interval for binomial probabilities (that is, the "Normal approximation interval" at http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval ). So far, so good, I think (though tell me if I'm wrong, please).

I also need to slice the data cumulatively by score, so "99+" means 99 or higher, etc:

(table 2)

score: 100; 250 documents, of which 225 are "good", so the cumulative accuracy estimate is 90%

score: 99+; 750 documents, of which 625 are "good", so the cumulative accuracy estimate is 83.3%

score: 98+; 1750 documents, of which 1225 are "good", so the cumulative accuracy estimate is 70%

etc...

I can certainly plug in the values in the confidence interval calculations, and it even seems like it might be right, but I fear I'm missing something.

From there, things get complicated. Sometimes the documents are sampled. Sampling is random, but not at the same rate for all scores (don't ask why.. ugh). So, we might see data like this:

(table 3)

score: 100; 250 documents, of which 100 were reviewed and 90 are "good", so the accuracy estimate is 90%

score: 99; 500 documents, of which 100 were reviewed and 80 are "good", so the accuracy estimate is 80%

score: 98; 1000 documents, of which 150 were reviewed and 90 are "good", so the accuracy estimate is 60%

etc...

Clearly with sampling, the confidence intervals will be larger.

It is straightforward, though tedious, to extrapolate from each row in the table to the estimated number of "good" documents with each score.. At score 100, 90% accuracy on 250 documents gives an estimate of 225 "good" documents; at score 99, 80% accuracy on 500 documents gives an estimate of 400 "good" documents, so at a cumulative score of "99+" we have an estimate of 625 "good" documents.

(In this case the numbers are ridiculously consistent, so the cumulative table would be the same as table 2 above.)

The documents with a score of 100 are sampled at a rate of 40%, those with a score of 99 are sampled at a rate of 20%, those with a score of 98 are sampled at a rate of 15%. My question now is how do you give confidence intervals for the cumulative sampled accuracy estimates? Given the need to extrapolate at each score level in order to properly weight the contribution based on the number of documents it represents, I am completely without a clue as to how to proceed with combining the confidence intervals.

Also, any advice or concerns on how to proceed with small numbers (e.g., there are 100 documents, sampled at a rate of 5%) would be much appreciated. For the non-sampled reports I use the Wilson score interval ( see Wilson score interval on the wikipedia page above).

Thanks a lot,

-Trey