There has been a flurry of interest the past couple of days over Judge Miller's order in re Biomet. In their e-discovery process, the defendants employed a keyword filter to reduce the size of the collection, and input only the post-filtering documents to their vendor's predictive coding system, which seems to be a frequent practice at the current stage of adoption of predictive coding technology. The plaintiffs demanded that instead the defendants apply predictive coding to the full collection. Judge Miller found in favour of the defendants.
In support of their position, the defendants prepared a detailed affidavit setting out their sample-based evaluation (as well as providing costings and other interesting details concerning their computer-assisted review). This affidavit is somewhat confusing in that it expresses the inclusiveness of the keyword filter in terms of prevalence rates in different parts of the collection (2% in the collection as a whole; 1% in the null set; 16% in the filtered-in set). What we really care about is recall; that is, the proportion of responsive documents that are included in the production---or rather, here, the upper bound on recall, since the filtered-in set is the input to, not the output from, review.
The sampling procedure used to estimate completeness of the filtering step is described in some detail in Paragraphs 11 through 13, Exhibit C, of the defendant's affidavit (physical page 86 of the PDF). The full collection is made up of 19.5 million documents (I'll round figures for simplicity). The keyword filter extracted 3.9 million of them into the document review system; de-duplication reduced this further to 2.5 million unique documents. The question then is, what proportion of the responsive documents made it into the filtered set; or, conversely, what proportion were left in the 16.6 million excluded documents?
The defendants performed three separate and independent (and partially redundant) samples to answer this question: one of the full collection, one of the filtered-in set, and one of the filtered-out set. The defendant's affidavit provides confidence intervals, but to begin with, let's just work with point estimates of the quantities of interest. The sample results were as follows:
|Segment||Population||Sample||# Responsive||% Responsive||Yield estimate|
|Filtered-in (dedup'ed)||2.5 million||1689||273||16.2%||405,000|
Immediately, you'll see we have a problem. The estimate of the filtered-in set exceeds that of the entire collection, let alone of the collection minus the filtered-out set. In fact, the latter (
212,000 222,000) is well outside the 99% confidence interval on the filtered-in set yield of [348,126; 464,887]. And that's understating matters: the filtered-out and collection estimates are undeduplicated, whereas the filtered-in estimate is deduplicated (with a reduction to around 65% of the undeduplicated size).
And indeed the working of the affidavit here is incorrect. I cite verbatim from Paragraph 11 of Exhibit C:
A random sample with a confidence level of 95% and estimation interval of 2.377% consisting of 1,689 documents was drawn from the 2.5+ million documents published to Axcelerate. This sample was reviewed to obtain a baseline relevance rate for the document population created by keyword culling. 273 documents were identified as relevant in the sample, indicating with 95% confidence that the percentage of relevant documents in the population is between 184,268 and 229,162 or stated in percentages, between 14.41% and 17.91%.
The interval of "between 184,268 and 229,162" is coherent with the other estimates in the above table, but it is _not_ "between 14.41% and 17.91%" of "2.5+ million documents", as a few seconds with a calculator will confirm (rather, it is between 7.37% and 9.17%). The 14.41% to 17.91% interval is correct given the stated sample output, but doesn't work with the other estimates.
I'm not sure what the mistake is that has been made here (perhaps confusing figures from pre-deduplication and post-deduplication?), but let's ignore the sample from the filtered-in set, and look just at the other two samples in the above table. The point estimates are round 370,000 responsive documents in the full collection, and
158,000 148,000 responsive documents in the filtered-out set. This means that roughly 42% 40% of the responsive documents are excluded from the set sent to document review, or in other words that even a flawless review process can achieve a maximum recall of 58% 60%. (And that's further assuming that the collection itself has not excluded responsive material.)
It's not my intention to comment on whether the defendant's use of keyword pre-filtering is appropriate and proportionate for the particular circumstances of this particular case. But these figures do illustrate the likelihood that keyword pre-filtering will exclude a large volume of responsive data, and often (where defendants are not so thorough as the current ones have been in their sampling and validation protocol) without anyone being aware of it. The problems artificially imposed by vendors charging by volume aside, I venture to suggest that keyword pre-filtering does not represent best production practice.
And finally, to return to any earlier point, it's really time that practitioners and counsel stop producing the plethora of different estimates that we find in this case, and started sampling for, estimating, and reporting the prime quantity of interest: namely recall.