Comments on: What is the maximum recall in re Biomet? William Webber's E-Discovery Consulting Blog Tue, 23 Sep 2014 06:24:50 +0000 hourly 1 By: Highlights from the East Coast eDiscovery & IG Retreat 2014 | Clustify Blog – eDiscovery, Document Clustering, Predictive Coding, Information Retrieval, and Software Development Tue, 23 Sep 2014 06:24:50 +0000 […] Is it OK to use keyword search to cull down the document population before applying predictive coding?  Must be careful not to cull away too many of the relevant documents (e.g., the Biomet case). […]

By: william Mon, 06 May 2013 14:16:16 +0000 James,

I'd committed a calculation error myself -- the null set was 15.6 million, not 16.6 million, document (corrected in post). This would be undeduplicated, if I read the affidavit correctly. That still means that around 40% of the responsive material is excluded by the keyword filter.

By: James Keuning Mon, 06 May 2013 00:16:23 +0000 Thanks for writing this. I am still chewing on the numbers. I have asked a few questions on some other blogs and haven't gotten a response (

One questions about your numbers - you say that the 16.6 million were sampled. My understanding is that only the 15.5 null set was sampled - the 1.4 million documents which were duplicated out were not sampled. Thus we do not know the responsiveness of the duped out docs. Perhaps they were 100% responsive, perhaps zero!

"The remaining 15,576,529 documents not selected by keywords (referred to as the “null set”)..."

"To obtain the relevance rate of the null set, Biomet reviewed a random sample of 4,146 documents..."

By: Greg A. Wed, 01 May 2013 20:58:18 +0000 Putting aside the issue of the incorrect math, I think you rightly note that this case hinges on the efficacy of the keyword searches. I think it is a bit too broad of a brush to say that search terms will always exclude a (unreasonably) large amount of relevant documents.

There are widely known methodologies to test and improve search term recall and precision. Used correctly, I think that search terms are still an efficient and defensible means to cull the review universe prior to the application of predictive coding. I don't think you will hear that very much from the vendor side as it doesn't sell per gigabyte predictive coding charges.

In this case, what is not clear is whether any of these techniques were used and there is not any analysis that the search terms' recall was reasonable in proportion to the additional cost necessary to improve the recall.

I'll let Losey speak as to the lower bounds of acceptable recall for either search terms or predictive coding, but my two cents is that the test should be on the basis of proportionality.