What is the maximum recall in re Biomet?

There has been a flurry of interest the past couple of days over Judge Miller's order in re Biomet. In their e-discovery process, the defendants employed a keyword filter to reduce the size of the collection, and input only the post-filtering documents to their vendor's predictive coding system, which seems to be a frequent practice at the current stage of adoption of predictive coding technology. The plaintiffs demanded that instead the defendants apply predictive coding to the full collection. Judge Miller found in favour of the defendants.

In support of their position, the defendants prepared a detailed affidavit setting out their sample-based evaluation (as well as providing costings and other interesting details concerning their computer-assisted review). This affidavit is somewhat confusing in that it expresses the inclusiveness of the keyword filter in terms of prevalence rates in different parts of the collection (2% in the collection as a whole; 1% in the null set; 16% in the filtered-in set). What we really care about is recall; that is, the proportion of responsive documents that are included in the production---or rather, here, the upper bound on recall, since the filtered-in set is the input to, not the output from, review.

The sampling procedure used to estimate completeness of the filtering step is described in some detail in Paragraphs 11 through 13, Exhibit C, of the defendant's affidavit (physical page 86 of the PDF). The full collection is made up of 19.5 million documents (I'll round figures for simplicity). The keyword filter extracted 3.9 million of them into the document review system; de-duplication reduced this further to 2.5 million unique documents. The question then is, what proportion of the responsive documents made it into the filtered set; or, conversely, what proportion were left in the 16.6 million excluded documents?

The defendants performed three separate and independent (and partially redundant) samples to answer this question: one of the full collection, one of the filtered-in set, and one of the filtered-out set. The defendant's affidavit provides confidence intervals, but to begin with, let's just work with point estimates of the quantities of interest. The sample results were as follows:

Segment Population Sample # Responsive % Responsive Yield estimate
Collection 19.5 million 4146 80 1.9% 370,000
Filtered-out 16.6 15.6 million 4146 39 0.95% 158,000 148,000
Filtered-in (dedup'ed) 2.5 million 1689 273 16.2% 405,000

Immediately, you'll see we have a problem. The estimate of the filtered-in set exceeds that of the entire collection, let alone of the collection minus the filtered-out set. In fact, the latter (212,000 222,000) is well outside the 99% confidence interval on the filtered-in set yield of [348,126; 464,887]. And that's understating matters: the filtered-out and collection estimates are undeduplicated, whereas the filtered-in estimate is deduplicated (with a reduction to around 65% of the undeduplicated size).

And indeed the working of the affidavit here is incorrect. I cite verbatim from Paragraph 11 of Exhibit C:

A random sample with a confidence level of 95% and estimation interval of 2.377% consisting of 1,689 documents was drawn from the 2.5+ million documents published to Axcelerate. This sample was reviewed to obtain a baseline relevance rate for the document population created by keyword culling. 273 documents were identified as relevant in the sample, indicating with 95% confidence that the percentage of relevant documents in the population is between 184,268 and 229,162 or stated in percentages, between 14.41% and 17.91%.

The interval of "between 184,268 and 229,162" is coherent with the other estimates in the above table, but it is _not_ "between 14.41% and 17.91%" of "2.5+ million documents", as a few seconds with a calculator will confirm (rather, it is between 7.37% and 9.17%). The 14.41% to 17.91% interval is correct given the stated sample output, but doesn't work with the other estimates.

I'm not sure what the mistake is that has been made here (perhaps confusing figures from pre-deduplication and post-deduplication?), but let's ignore the sample from the filtered-in set, and look just at the other two samples in the above table. The point estimates are round 370,000 responsive documents in the full collection, and 158,000 148,000 responsive documents in the filtered-out set. This means that roughly 42% 40% of the responsive documents are excluded from the set sent to document review, or in other words that even a flawless review process can achieve a maximum recall of 58% 60%. (And that's further assuming that the collection itself has not excluded responsive material.)

It's not my intention to comment on whether the defendant's use of keyword pre-filtering is appropriate and proportionate for the particular circumstances of this particular case. But these figures do illustrate the likelihood that keyword pre-filtering will exclude a large volume of responsive data, and often (where defendants are not so thorough as the current ones have been in their sampling and validation protocol) without anyone being aware of it. The problems artificially imposed by vendors charging by volume aside, I venture to suggest that keyword pre-filtering does not represent best production practice.

And finally, to return to any earlier point, it's really time that practitioners and counsel stop producing the plethora of different estimates that we find in this case, and started sampling for, estimating, and reporting the prime quantity of interest: namely recall.

4 Responses to “What is the maximum recall in re Biomet?”

  1. Greg A. says:

    Putting aside the issue of the incorrect math, I think you rightly note that this case hinges on the efficacy of the keyword searches. I think it is a bit too broad of a brush to say that search terms will always exclude a (unreasonably) large amount of relevant documents.

    There are widely known methodologies to test and improve search term recall and precision. Used correctly, I think that search terms are still an efficient and defensible means to cull the review universe prior to the application of predictive coding. I don't think you will hear that very much from the vendor side as it doesn't sell per gigabyte predictive coding charges.

    In this case, what is not clear is whether any of these techniques were used and there is not any analysis that the search terms' recall was reasonable in proportion to the additional cost necessary to improve the recall.

    I'll let Losey speak as to the lower bounds of acceptable recall for either search terms or predictive coding, but my two cents is that the test should be on the basis of proportionality.

  2. James Keuning says:

    Thanks for writing this. I am still chewing on the numbers. I have asked a few questions on some other blogs and haven't gotten a response (http://bit.ly/15mqa5f).

    One questions about your numbers - you say that the 16.6 million were sampled. My understanding is that only the 15.5 null set was sampled - the 1.4 million documents which were duplicated out were not sampled. Thus we do not know the responsiveness of the duped out docs. Perhaps they were 100% responsive, perhaps zero!

    "The remaining 15,576,529 documents not selected by keywords (referred to as the “null set”)..."

    "To obtain the relevance rate of the null set, Biomet reviewed a random sample of 4,146 documents..."

  3. william says:

    James,

    I'd committed a calculation error myself -- the null set was 15.6 million, not 16.6 million, document (corrected in post). This would be undeduplicated, if I read the affidavit correctly. That still means that around 40% of the responsive material is excluded by the keyword filter.

  4. […] Is it OK to use keyword search to cull down the document population before applying predictive coding?  Must be careful not to cull away too many of the relevant documents (e.g., the Biomet case). […]

Leave a Reply