Why confidence intervals in e-discovery validation?

A question that often comes up when discussing e-discovery validation protocols is, why should they be based on confidence intervals, rather than point estimates? That is, why do we say, for instance, "the production will be accepted if we have 95% confidence that its recall is greater than 60%"? Why not just say "the production will be accepted if its estimated recall is 75%"? Indeed, there have been some recent protocols that take the latter, point-estimate approach. (The ESI protocol of Global Aerospace, Inc. v. Landow Aviation, for instance, states simply that 75% recall shall be the "acceptable recall criterion", without specifying anything about confidence levels.) Answering this question requires some reflection on what we are trying to achieve with a validation protocol in the first place.

There are two ways that a validation protocol can fail: it can pass a bad production (a false positive error); and it can reject a good one (a false negative error). We want the probability of each of these to be as low as possible, of course, but there is an unavoidable tradeoff. The stricter we make the threshold, the less likely we are to accept a bad production, but the more likely we are to reject a good one. Conversely, looser thresholds will pass good productions more reliably, but pass bad ones more readily, too. The only way to decrease the likelihood of both failure modes is, not to adjust the threshold, but to increase the sample size. And so ultimately what we want to know is, what sample size is required to get both validation failure probabilities down to acceptable levels?

The confidence interval approach directly answers this question of failure probability. When we say that "recall is greater than 60% with 95% confidence", we are expressing the probability that a false positive will occur, if we accept the production. Here, the 60% recall is the lower limit on an acceptable quality of production. The unstated third term in this equation is the sample size; but it is the sample size that allows us to make these statements about confidence.

Conversely, if the production fails -- if the lower bound is less that our threshold (here, 60%) -- we can look at the upper bound on recall, and check that it is below the threshold of a system that should reliably pass. If, for instance, we had a confidence interval on recall of [55%, 95%], then we have a serious problem: we can't pass the production (lower-bound recall is only 55%), but we can't be confident that the production is not actually good (upper-bound recall is 95%).

We'd prefer not to realize that we've failed a good system after validation is complete, of course. Therefore, we plan the sample size of our validation protocol in advance, to reduce the chance of failing a good production to an acceptable level. (And note that the lower bound of a "good" production must be higher than the upper bound of a "bad" production; otherwise, the probability of rejecting a good system is always one minus the probability of accepting a bad one.) Planning the protocol to pass good productions requires similar reasoning about sample sizes and failure probabilities to that of calculating the confidence interval after the production is sampled.

The problem with using a point estimate alone is that it gives us none of this information about confidence and failure probabilities. It fails to do so because it does not consider the size of the sample we have used. A point estimate of 75% recall means quite a different thing on a sample of 100 documents than it does on a sample of 10,000.

One motive for the preference for point intervals is the perception that confidence intervals make the protocol unnecessarily more difficult to satisfy. This objection is poorly founded in theory, in that one can generally state a confidence interval requirement that is as loose or strict as a point estimate one. The objection perhaps has some psychological basis, as it may seem (to the statistically naive) that a 95% confidence threshold of 60% is looser than a point estimate threshold of 75% recall. But that psychological judgment is achieved only by ignoring what the different bounds are telling us.

Leave a Reply