Meaningful QA sample size in e-discovery

In my last post, I examined the live-blog e-discovery production being performed by Ralph Losey, and asked what lower limit we could place on the recall of highly relevant documents with 95% confidence based on the final, quality assurance sample. The QA sample drew 1065 documents from the null set (that is, the set of documents that were not slated for production). Although none of these documents were highly relevant, this still only allows us to say with 95% confidence that no more than 0.281% of the null documents are highly relevant. Since there are 698,423 null documents, this represents an upper bound of 1962 highly relevant documents that have been missed. As only 18 highly relevant documents were found in the production, Ralph's lower-bound recall is 1%. To get this lower-bound recall up to 50%, you'd need to sample around 100,000 documents from the null set without finding any highly relevant.

Such figures might cause us to question the usefulness of quality assurance sampling. How reasonable is it to include assurance requirements that force us to review tens of thousands of documents to verify that we've only found a highly relevant handful because highly relevant documents are rare? One way to address the question of effort required is to set up a simple protocol, and see how big a sample is necessary for this protocol to be satisfiable. Let's say the requirement is that we have 95% confidence in a lower-bound of 50% recall. And, to keep things simple, let's say that we'll select a QA sample large enough to meet this requirement, provided none of the documents in it prove relevant (at whatever threshold of relevance we're testing). (This is a primitive form of [edit: power] analysis, adequate only for the sake of our current discussion. To do things properly, we need also to consider the likelihood of a sample appear for an hypothesized level of "acceptable" system performance. Don't choose sample sizes for QA of actual productions based only on the discussion here!)

With these assumptions, we quickly realize that there are two factors that determine the required size of the QA sample: the size of the null set, and the size of the production (assuming that the production consists of all and only manually-verified relevant documents). Let r be the size of the production, and N the size of the null set. The sample size n required for an all-negative sample to satisfy our protocol is:


n = \log_{1 - r/N}0.05

Setting the null size set at 700,000, as for Ralph's production, this equates to the following figure:

Sample size for production size

Sample size for production size

Required sample size drops precipitously with increased production size; so precipitously that it is difficult to read off corresponding values. Plotting on a log-log graph makes the correspondence clearer; the relationship becomes a straight line:

Sample size for production size (log-log)

Sample size for production size (log-log)

While a production of 20 documents requires a sample size around 100000 for our QA protocol to be satisfiable, a sample size of 1000 documents is sufficient once the production exceeds 2000 documents. The size of the null set also effects minimum protocol-satisfiable sample size; where r \ll N, doubling the null set size roughly doubles the sample size required. Nevertheless, if the production is large enough, then the our QA protocol is satisfiable with a feasibly-sized sample.

In short, the informativeness of a QA sample depends upon the characteristics of topic and collection. If relevant documents are rare, then even an effective production will be small; meanwhile, sampling can set only so low an upper bound on the proportion of relevant documents in the null set; and if the collection itself is large, then the plausible number of missed relevant documents will overwhelm the number actually found. If relevant documents are less rare, though, and if the collection size has been kept under control, then productions will be larger, plausibly missed documents will be fewer, and QA sampling is able to confirm a satisfactory level of recall.

Leave a Reply