What is the maximum recall in re Biomet?

April 24th, 2013

There has been a flurry of interest the past couple of days over Judge Miller's order in re Biomet. In their e-discovery process, the defendants employed a keyword filter to reduce the size of the collection, and input only the post-filtering documents to their vendor's predictive coding system, which seems to be a frequent practice at the current stage of adoption of predictive coding technology. The plaintiffs demanded that instead the defendants apply predictive coding to the full collection. Judge Miller found in favour of the defendants.
Read the rest of this entry »

Stratified sampling in e-discovery evaluation

April 18th, 2013

Point- and lower-bound confidence estimates on the completeness (or recall) of an e-discovery production are calculated by sampling documents, from both the production and the remainder of the collection (the null set). The most straightforward way to draw this sample is as a simple random sample (SRS) across the whole collection, produced and unproduced. However, the same level of accuracy can be achieved for a fraction of the review cost by using stratified sampling instead. In this post, I introduce the use of stratified sampling in the evaluation of e-discovery productions. In a later post, I will provide worked examples, illustrating the saving in review cost that can be achieved.
Read the rest of this entry »

Does automatic text classification work for low-prevalence topics?

January 26th, 2013

Readers of Ralph Losey's blog will know that he is an advocate of what he calls "multimodal" search in e-discovery; that is, a search methodology that involves a mix of human-directed keyword search, human-machine blended concept search, and machine-directed text classification (or predictive coding, in the e-discovery jargon). Meanwhile, he deprecates the alternative model of machine-driven review, in which only text classification technology is employed, and the human's sole function is to code machine-selected documents as responsive or non-responsive. The difference between the two modes can be seen most clearly in the creation of the seed set (that is, the initial training set created to bootstrap the text classification process). In multimodal review, the seed set might be taken from (a sample of) the results of active search by the human reviewer; in machine-driven review, the seed set is formed by a random sample from the collection.
Read the rest of this entry »

Why confidence intervals in e-discovery validation?

December 9th, 2012

A question that often comes up when discussing e-discovery validation protocols is, why should they be based on confidence intervals, rather than point estimates? That is, why do we say, for instance, "the production will be accepted if we have 95% confidence that its recall is greater than 60%"? Why not just say "the production will be accepted if its estimated recall is 75%"? Indeed, there have been some recent protocols that take the latter, point-estimate approach. (The ESI protocol of Global Aerospace, Inc. v. Landow Aviation, for instance, states simply that 75% recall shall be the "acceptable recall criterion", without specifying anything about confidence levels.) Answering this question requires some reflection on what we are trying to achieve with a validation protocol in the first place.
Read the rest of this entry »

The environmental consequences of SIGIR

December 4th, 2012

As it is becoming apparent that, without drastic immediate action, we are going to significantly overshoot greenhouse gas emission targets and warm the planet by an environmentally disastrous 4 to 5 degrees centigrade by the end of the century, I thought I should fulfil my long-standing promise to myself and calculate the carbon emissions generated by the annual SIGIR conference. I'm only going to consider here the emissions caused by air travel; but this is likely to be the overwhelming majority of total emissions.
Read the rest of this entry »

Statistical power of E-discovery validation

September 5th, 2012

My last post introduced the idea of the satisfiability of a post-production quality assurance protocol. We said that such a protocol is not satisfiable for a given size of the sample from the unretrieved (or null) set if the protocol were to fail the production even if the sample found no relevant documents. The reason a protocol could fail in such a circumstance is that the upper bound of the confidence interval on the missed relevant documents could still be above our threshold.
Read the rest of this entry »

Meaningful QA sample size in e-discovery

August 13th, 2012

In my last post, I examined the live-blog e-discovery production being performed by Ralph Losey, and asked what lower limit we could place on the recall of highly relevant documents with 95% confidence based on the final, quality assurance sample. The QA sample drew 1065 documents from the null set (that is, the set of documents that were not slated for production). Although none of these documents were highly relevant, this still only allows us to say with 95% confidence that no more than 0.281% of the null documents are highly relevant. Since there are 698,423 null documents, this represents an upper bound of 1962 highly relevant documents that have been missed. As only 18 highly relevant documents were found in the production, Ralph's lower-bound recall is 1%. To get this lower-bound recall up to 50%, you'd need to sample around 100,000 documents from the null set without finding any highly relevant.
Read the rest of this entry »

Quality assurance samples and prior beliefs

August 8th, 2012

Those who are following Ralph Losey's live-blogged production of material on involuntary termination from the EDRM Enron collection will know that he has reached what was to be the quality assurance step (though he has decided to do at least one more iteration of production for the sake of scientific verification). Quality assurance here involves taking a final sample from the part of the collection that is not to be produced -- what Ralph terms the "null set" -- and checking to see if any relevant documents have been missed. The outcome of this QA sample has led to an interesting discussion between Ralph and Gord Cormack on the use and meaning of confidence intervals, and how sure we can really be that (almost) no relevant documents have been missed. I've commented on the discussion at Ralph's blog; I though it would not be amiss to expand upon those comments here.
Read the rest of this entry »

Tutorial on confidence intervals in e-discovery

August 2nd, 2012

Ever since Judge Grimm opined that random sampling constituted prudent for checking the reliability of a production (Victor Stanley v. Creative Pipe, 269 F.R.D. 497), there has been strong interest in the topic of sampling within e-discovery, including from lawyers themselves. Ralph Losey, for interest, has devoted a post in his blog to the topic of sampling, and his recent blog posts narrating an example predictive coding exercise have contained much sampling-related material.

I've written some research work on more advanced topics in confidence intervals, but I thought it might be useful to write some more introductory material as well. I originally intended to write a series of blog posts giving a brief tutorial on sampling and estimation, but the brief tutorial worked out to be around 5,000 words, so I've made it into a separate document: A tutorial on interval estimation for a proportion, with particular reference to e-discovery. The tutorial aims to give an understanding of the workings behind confidence intervals, while avoiding as much math as possible. (If you want a still more high-level discussion of sampling, estimation, and intervals, then I recommend Venkat Rangan's post on Predictive Coding -- Measurement Challenges.) The tutorial is marked as Version 0.1; I'd be very grateful for any corrections, comments, or suggestions for improvement, and will work them in to later versions.

Do document reviewers need legal training?

July 15th, 2012

In my last post, I discussed an experiment in which we had two assessors re-assess TREC Legal documents with less and more detailed guidelines, and found that the more detailed guidelines did not make the assessors more reliable. Another natural question to ask of these results, though not one the experiment was directly designed to answer, is how well our assessors compared with the first-pass assessors employed for TREC, who for this particular topic (Topic 204 from the 2009 Interactive task) happened to be a review team from a vendor of professional legal review services. How well do our non-professional assessors compare to the professionals?
Read the rest of this entry »