It is common in e-discovery protocols to see a requirement that the production be certified with a "95% +/- X%" sample (where "X%" takes on values such as "2%" or "5%"), leading to a required sample size being specified up front. (See, for instance, the ESI protocol that was recently debated in the ongoing Da Silva Moore case.) This approach, however, makes little sense, for two reasons. First, it specifies an accuracy in our measure, when what we want to specify is some minimal level of performance. And second, decisions about sample size and allocation should be delayed until after the (candidate) production is ready, when they can be made much more efficiently and effectively.
Before discussing those two assertions in more detail, let's unpack what is going on with the "95% +/- 2%" specification. What this is saying is that we will draw a sample which will set a confidence interval no wider than 4% end to end (+/- 2%, though in practice the interval is not always symmetric around the point estimate). For simple random samples to estimate the prevalence (or richness) of relevant documents in document set, the maximum width of the interval can be determined from the sample size, and occurs when the point estimate is 50%. We are therefore saying that our widest interval will be [48%, 52%]. From this, we can figure out the sample size required: for +/- 2% and an exact binomial confidence interval, the sample size is 2,399.
Prevalence is not what we're really interested in with certification samples, however; rather, we care about recall. For estimating recall from a simple random sample, what matters is not the size of the full sample, but the number of relevant documents that happen to fall in the sample. We therefore need some basis for estimating corpus prevalence, such as the initial control sample (used internally by the producing party to guide their production). Or alternatively, we can keep sampling until we achieve the desired number of relevant documents. In the latter case, we end up with a slightly biased estimator; in the former, we can only offer a probabilistic guarantee on the maximum width (since our estimate of corpus prevalence might be off).
What is wrong with this approach? First of all, we are guaranteeing the wrong thing. The accuracy of our measure is only a means to the end: the end is certifying that the production meets some lower bound on recall (with some probabilistic degree of confidence). Undertaking to accurately measure production recall is of little value if we end up with a very accurate certification that we have a very lousy production. And even the accuracy that we are measuring is likely to be at the wrong point, since it is unlikely that 50% recall is going to be our minimum threshold on performance. Instead, we should be certifying something of the form of "95% confidence that production recall is at least 65%", and designing our sampling strategy to maximize accuracy of measure at the specified recall threshold.
The second problem with the 95% +/- 2% approach is that it commits us to the details of a sample design before the production has been made, and even before review proper has commenced. We are designing blind, and therefore our design will be suboptimal. (In truth, even for the 95% +/- 2% case, we can't make the decision completely blind, since as mentioned we need a reasonable estimate of corpus prevalence in order to select sample size. But even this information is likely to be unreliable before serious review has commenced.) Instead, we should wait until we know what the (candidate) production set will be, and also have a basis for estimating not just overall corpus prevalence, but prevalence in the production and the null sets. With this knowledge, we can design a much more efficient certification sampling protocol, one that requires far fewer documents to be sampled and annotated in order to achieve the same level of reliability. One approach to drawing such a sample is stratified sampling; and I hope shortly to give a worked example showing just how dramatic the savings can be.
What this means in practice is that a sample size cannot be specified in an ESI protocol, at least not one agreed between the parties prior to review (when of course it should be agreed). Instead, what should be specified (in combination with other considerations such as cost, proportionality, and quality of the production process itself) is that a certain level of recall will be achieved, and certified to a certain level of confidence, with the producing side undertaking to propose a certification sampling design to achieve this post-production. If the producing side baulks at this as too strong a commitment, then the next best step is that a fixed sample size be agreed to, but that the sample design itself wait to be agreed between the two parties based upon the evidence of the production. Selecting this sample size will depend on heuristics and experience; but then the 95% +/- 2% approach, for all its seeming exactness, involves committing the parties to an inefficient certification process, in exchange for the wrong guarantees.
[...] is required—and we should think long and hard before we concede that it is not—then a separate certification sample must be [...]
I think there's a more fundamental problem with the way "plus or minus X%" is commonly misinterpreted in the edisco community. Here's a an example taken from the real world, with the numbers changed slightly:
"From a collection of 1 million documents, 100,000 documents were produced. A sample of 395 of the produced documents was taken, which showed that 80,000 of the produced documents were responsive, plus or minus 5%. A sample of 395 of the withheld documents was taken, which showed that 9,900 of the withheld documents were responsive, plus or minus 5%. Ergo, the recall was 80,000/89,900 = 89%, plus or minus 5%."
Correctly calculated (which, as you know, takes some effort), the margin of error on the recall is more like "plus or minus 50%," ten times worse than claimed, rendering the estimate practically useless.
The key problem with the reasoning above is the "9,900 documents, plus or minus 5%" is a horribly imprecise estimate. This becomes clear when you convert the percentage to the number of documents. What this estimate really says is "9,900 document, plus or minus 45,000 documents." The margin of error is five times as large as the point estimate!
The moral of the story: if you are counting documents, state your point estimate and margin of error in terms of documents.
Gordon,
Hi! Yes, gosh, that is quite wrong. You can't just propagate the "plus or minus 5%" like that.
Using my offline calculator for recall confidence intervals, the correct interval on recall is [78.9%, 96.2%]; that is, roughly minus 10%, plus 7%.
Effectiveness estimation and reporting is so rife with errors in e-discovery cases at the moment, including recent and ongoing precedent-setting cases (not to name names). Both vendors and lawyers need to seek professional advice on quantitative matters in order to represent their clients defensibly. Given the ubiquity of statistical howlers, the side with such professional advice will enjoy an enormous advantage over their opponents.
William
"(not to name names)"
Please, let's do!