Estimating failure rate



Hi all,

I need some assistance in guiding folks on a sampling problem. We have an application that performs transactions and we're in the process of making modifications. We can't (unwilling to spend the money) test all the possible permutations of all the variables, so we want to randomly sample some transactions.

Here's my issue. My first reaction was to estimate the failure rate using a binomial distribution. So, if we observe 5 failures in 50,000 randomly selected transactions, we should be able to calculate the observed correctness (.9999 in this case) and estimate the 95% CI (1.96 * sqrt((P * (1- P)) / N)).

However, I don't think it's reasonable to assume that failures in code are completely randomly distributed throughout the code. It's much more likely that the code has hot spots, and a single code flaw will cause a cluster of transaction failures given some set of variables. Empirically, I created some 2 dimensional tables and in half the tests randomly seeded the table with failures. In the other half, I formed clusters of failures. When I then randomly pulled observations from each table, the true failure rate in the clustered tables were almost twice as likely to be outside the calculated 95% CI than in the purely random tables.

I *think* the answer to my question lies in using a negative binomial distribution somehow, but I've struggled to find material which I can interpret effectively (I'm not a stats major).

So, my two questions. First, does it sound like negative binomial is in the right direction and second, anyone got a good primer on using it effectively?