Those who are following Ralph Losey's live-blogged production of material on involuntary termination from the EDRM Enron collection will know that he has reached what was to be the quality assurance step (though he has decided to do at least one more iteration of production for the sake of scientific verification). Quality assurance here involves taking a final sample from the part of the collection that is not to be produced -- what Ralph terms the "null set" -- and checking to see if any relevant documents have been missed. The outcome of this QA sample has led to an interesting discussion between Ralph and Gord Cormack on the use and meaning of confidence intervals, and how sure we can really be that (almost) no relevant documents have been missed. I've commented on the discussion at Ralph's blog; I though it would not be amiss to expand upon those comments here.

In his production, Ralph is separately coding relevant and highly relevant documents. Let's look at the case of estimating the number of highly relevant documents that may have been missed by the production; the numbers involved are smaller, and more extreme cases of the sampling and estimation procedure are tested.

Ralph's production of relevant and highly relevant documents can be divided into three main stages:

- An initial simple random sample of 1507 documents from the full collection of 699,082 documents, none of which sampled documents were found to be highly relevant.
- Several iterations of search, predictive coding, and review, culminating in a production list of 659 relevant documents, of which 18 were highly relevant. All of these 659 documents have been reviewed by Ralph and confirmed relevant. This leaves 699,082 - 659 = 698,423 documents in the null set. Some of the null set have also been reviewed by Ralph and found irrelevant, but that information is ignored for the next step.
- A final random sample of 1065 from the null set, of which 0 were found to be highly relevant. We'll refer to this as the quality assurance (QA) sample.

The question now is, how many highly relevant documents may Ralph have missed in the null set, and thus have failed to produced? Ralph observes that since the QA sample produced no highly relevant documents, our best estimate is that no highly relevant documents have been missed; if we add in the evidence of the initial sample, which also found no highly relevant documents, and of the production process, which only found 18 highly relevant documents, then it seems likely that very few, if any, have been missed.

At this point, though, Gordon observes that if we calculate a confidence interval on the number of highly relevant documents, we get rather a different story. Gordon calculate a two-tailed 95% confidence interval, but since we're looking for a probabilistic upper bound (and the lower bound is evidently going to be 0), I'll use a one-tailed upper confidence interval. Using the exact (Clopper-Pearson) interval on a proportion, the one-tailed 95% confidence upper bound on the proportion of highly relevant documents, based upon finding 0 in a sample of 1065, is 0.281%. Given there are 698,423 documents in the null set, this amounts to 1956 highly relevant documents. Thus, based on this final sample alone, the most we can say with 95% confidence is that we've found 18 / (18 + 1956) = 1% of the highly relevant documents. This is not, on the surface, an encouraging finding.

How do we manage to conjure the ghosts of 1956 highly relevant documents potentially lurking in the null set from a sample containing 0 such documents? We work backwards as follows. Assume that 0.281% of the null set was in fact highly relevant; that is, that 99.719% of the null set is not highly relevant. What is the probability of sampling 1065 documents from a set in which 99.719% are not highly relevant and having no highly relevant documents in the sample? It is 0.99719 ^ 1065 = 5%. Thus, our upper bound on the 95% interval is the proportion relevant for which our observed sample result (or fewer -- but you can't have fewer than 0) has a 5% chance of occurring.

At this stage, however, as Ralph observes, something seems to be off in our reasoning. We did an initial random sample of 1507 documents, of which none were highly relevant. The 95% upper bound on that sample 0.198%, or 1388 highly relevant documents. We then did a production and found 18 highly relevant documents. Now we're estimating an upper bound of 1956 highly relevant documents. Why have we gone backwards?

The issue is that one cannot in general simply add up the evidence from the two samples, especially when the first sample has been used to help form the production that the second sample is evaluating. Care needs to be taken in assessment protocols to make sure that sample evidence does not become "polluted" by being entangled in process being measured; this is expressed in the maxim "separate training and testing data".

In fact, for this particular protocol, we *can* more or less add the two samples together, provided we assume that the production has done at least as well as random (and so has avoided "pushing" highly relevant documents out of the production into the null set) -- and, given that 2.7% of the production is highly relevant, this is a reasonable enough assumption. (And the production is too small to have much impact in any case). So we can add the 1507 to the 1065 sample to derive a sample size of 2572, and a 95% confidence upper bound of 813 highly relevant documents left in the null set. (We're violating the strict assumptions of the exact binomial interval here, but the answer is a very good approximation.) But even with this reduced upper bound on the highly-relevant document missed, we've still only got a 95% lower bound of 18 / (18 + 813) = 2% recall.

What about the fact that an expert searcher (Ralph) and a reputable predictive coding system (Kroll OnTrack's inView) have made several iterations over the document collection, and that they've only been able to find 18 highly relevant documents? Does this influence our confidence? Put another way, had the production been performed instead by randomly picking 659 documents, and a QA sample on the null set produced the same result of no highly relevant documents, would we assign the same probabilistic upper-bound estimate? The answer to the latter two questions are, respectively, "no", and "yes": no, the fact that a careful production been performed doesn't change our estimate; and yes, we'd give the same upper bound (for the same sample evidence) even if the production had been made at random.

It is not that our beliefs (prior to the sample) about the thoroughness of the production are irrational or have no evidentiary basis. Subjectively, they have a reasonable foundation (though we'd be wise not to place too much trust in our own judgment). Nor is it merely that we don't want to allow our estimate to depend upon assumptions about the quality of the process being estimated (though this is certainly part of the issue). The central problem is that we don't know how objectively to quantify and justify the probabilities associated with our prior beliefs and the evidence they are based on. We can't model the process by which the production has found the highly relevant documents it has, and so we can't say how likely it is that it has missed others. Random sampling, however, follows a simple selection process that we can model and reason, probabilistically, about. Chance is predictable; choice (by machine or human) is not: that is the foundational insight of modern statistical science.

Nevertheless, the self-imposed amnesia involved in the final QA sample emphasizes that such a sample does not, by itself, constitute an adequate assurance of the quality of a production. (Put another way, the production could be quite inept, and there still be a good chance that the ineptitutde would be missed by the QA sample, if relevant documents are rare enough.) Rather, such QA sampling needs to take place as part of a proven production protocol, one which incorporates various assessment checks, even if their evidence cannot formally be combined into a single confidence interval. The development of a standard for such protocols, as recently proposed by Jason Baron through an ANSI working group, can guide practitioners towards best practice, and alleviate them from the responsibility of having to roll their own processes.

William,

Thanks much for emphasizing and alaborating on the distinctness of the two samples for use in gleaning cinfidence intervals.

One question about your departure from Gordon's first (presumably cursory) approach of using a 2-tailed interval - you elected to use a 1-tailed interval to more accurately characterize the upper bound we are interested in. Can you comment on how this is a valid approach and not just an on-the-fly-reaction driven by the observed measurement of zero? My limited understanding is that your approach is driven by the desire to more accurately characterize just the upper bound, but using a 1-tailed calculation wouldn't be appropriate solely because the measurement in a particular sample happened to be zero.

(Specifically I'm referring to John Pezullo's reported discussion with Karl Schlag ( at the bottom of his CI-calculations page - http://statpages.org/confint.html )

Ethan,

Hi! Thanks for a very insightful comment.

You're right that the choice to use a one-tailed rather than a two-tailed interval should be made as part of the protocol itself, before one observes the results of the sample. However, for this sort of quality assurance work, a one-tailed interval is in principle the correct one to use. The reason is that we are trying to bound the probabilistically worst-case behavior of our production. If we couldn't state confidently that our performance was at least as good as such and such a level, then it might cause some decision to be made (to continue searching, to renegotiate the terms of the production, etc.). No such decision, however, is provoked by the probablistically best-case behaviour; we don't say, "oh dear, our production might really be much better than we want; we'd better degrade it somehow".

For the initial sample that is made at the outset of a production, however, a two-tailed confidence interval is what we want, because we do want to predict within which bounds the number of relevant documents lies, and make decisions about our forthcoming production accordingly.

In an ideal world, we would always use a 1-tailed test for e-disco validation. What we really want do know is, "recall is at least X, with Z% confidence" (1-tailed) not "recall is between X and Y, with Z% confidence" (2-tailed). But to do that you must decide a priori that you are planning to use a 1-tailed test.

When I do this for real, I do use 1-tailed tests, but since Ralph used 2-tailed, I did not think it was appropriate to appear to switch horses in mid stream. And, to be honest, I did not want to introduce this factor in a legal blog.

As William points out, you get a slightly higher lower bound if you use a 1-tailed test, but the overall conclusion is the same.