My last post introduced the idea of the satisfiability of a post-production quality assurance protocol. We said that such a protocol is not satisfiable for a given size of the sample from the unretrieved (or null) set if the protocol were to fail the production even if the sample found no relevant documents. The reason a protocol could fail in such a circumstance is that the upper bound of the confidence interval on the missed relevant documents could still be above our threshold.

Generally in statistics when we think about such tests, we look for something more than satisfiability. Rather, we think about how confident we are that what we consider a sufficiently complete production would fail the protocol -- what is known as the *power* of the test. Power stands in tension to the strictness of the test, the confidence required for passing the threshold. The stricter our test, the less likely bad systems are to pass, but the more likely good systems are to fail.

Let's put this in concrete terms. To make things simpler, we'll work initially in terms of what Herb Roitblat calls elusion: the proportion of non-produced documents that are relevant. (This can be directly mapped to recall, but converting backwards and forwards is going to confuse us.) Our QA protocol might then be stated in such terms:

A production only passes QA if we have 95% confidence that elusion is no more than 1%.

Once we add to this a sample size, the protocol can be stated in terms of a minimum number of relevant documents that will fail the production if they are found in the sample. So, for instance, if we have a sample of 1000 documents, then the above protocol will fail if 5 or more relevant documents are found in the sample; the upper-tailed exact 95% confidence interval on elusion with a sample of 5 relevant out of 1000 is 1.05%, whereas for 4 relevant it is 0.91%. (For those who are savvy with the R statistical language, we calculate the interval using `binom.test(5, 1000, alternative="less")`, and look at the result marked "95% confidence interval". The threshold value -- here 5 -- is calculated using `qbinom(1 - 0.95, 1000, 0.01)`; that is, the 5th percentile of the Binom(1000, 0.01) sampling distribution.)

So now we need to decide what actual sample size is appropriate for our protocol. This decision hinges on two questions. The first question is how confident we want to be that a good production would pass our protocol, what is known as the "power" of the test. A power of 80% is a common choice. The second question is what we would consider a "good" production.

It might seem at first blush that the answer to the question "what is a good production?" is "a good production is one with elusion of no more than 1%", which is what our protocol is testing for. But in fact a production that has the threshold elusion of the protocol only has a 5% chance or less of passing that protocol; this is what gives us the 95% confidence that a passing production has threshold elusion or higher. (If in fact you want to define 50% production as good, then you need to rethink your protocol, and set a lower rejection threshold.)

More generally, for a given sample size, we can state the probability that a production of a given level of performance will pass our protocol. For instance, if our sample size is 1000, and the production has actual elusion 0.5% (half the threshold amount), then it still has a 56% chance of failing our protocol. (In R, `1 - pbinom(5 - 1, 1000, 0.005)`, where 5 is threshold failure sample positives calculated earlier). In fact, the maximum elusion that would have at least an 80% chance of passing our protocol is 0.23%, under a quarter of the threshold amount (in R, `1 - qbeta(0.8, 1000 - (5 - 1) + 1, 5 - 1)`, exploiting the relationship between the beta and binomial distributions).

What does this all mean for recall? Well, given a retrieved and an unretrieved size, and assuming all retrieved documents have been verified relevant (and all known relevant documents are retrieved), then a recall value can be mapped into an elusion value, and used in the above calculations. Let's therefore restate our QA protocol in terms of recall:

A production only passes QA if we have 95% confidence that recall is no less than 50%.

We also need to specify what constitutes a good production, one that we want to be 80% confident will pass our protocol. Let's specify that a production with actual recall of 75% is good. Then we can ask, what sample size is required to satisfy our evaluation criteria for a given retrieval size? The following graph gives the answer:

... but it's a bit hard to read. Let's show the same thing in a log-log graph (note the axes carefully):

The line is approximately straight in the log-log graph because there's an inverse relationship between retrieval and sample size (doubling the retrieval size roughly halves the sample size required), at least while retrieval size is a small fraction of collection size. In fact, *for this particular collection size, and threshold and "good" recall*, the approximate relationship is captured by the formula:

(Don't pick sample sizes for actual protocols using this approximation, however: our test statistic -- relevant documents sampled -- is discrete, and inference from discrete statistics can be very sensitive to apparently minor differences in sample size.)

From the figure we can see that, if we are going to validate our production with a post-production sample of 2,400 from the unretrieved set, then there's no point performing the validation unless our retrieval has found at least 2,700-odd relevant documents, because otherwise even a good production can fail the validation. And, if we set the upper bound on the number of documents we're prepared to look at during validation to 10,000, then productions of less than 640 documents essentially cannot be validated, unless we're prepared to loosen our validation requirements. If, for instance, we set the threshold recall at 30%, and require only 90% confidence in beating this threshold (while retaining 75% recall as our definition of an "good" retrieval), then required sample size falls from around 10,000 to below 2,000. But we'd want to be sure that we're willing to accept the reduced confidence and lower threshold.

In summary, the feasibility of using sampling-based methods to validate the recall of an e-discovery production is highly sensitive to production size. For small productions, strict protocols will require an impractically large sample size. Protocol parameters should be determined with an eye to expected collection yield and production size, and where expectation or (production-independent) evidence suggests relevant documents are rare, alternative levels or methods of production validation must be considered.