Measuring incremental cost-to-production in predictive coding

August 14th, 2013

I had the opportunity on Monday of giving a talk on processes for predictive coding in e-discovery to the Victorian Society for Computers and the Law. The key novel suggestion of my talk was that the effectiveness of the iteratively-trained classifier should be measured not (only) by abstract metrics of effectiveness such as F score, but (also) directly by the cost / benefit tradeoff facing the production manager. In particular, I advocated a new ranking metric, depth for recall.
Read the rest of this entry »

Change of career, change of name

August 13th, 2013

This blog has followed by own research interests in becoming increasing focused upon evaluation and technology question in e-discovery, rather than in information retrieval more generally. Now my own career has followed my interests out the ivy-clad gates of academia and into private consulting in e-discovery. In recognition of these changes, I've also changed the name of my blog, from "IREvalEtAl" to "Evaluating E-Discovery". There will be some other cosmetic changes to follow, but (for now at least) we're at the same URL and on the same RSS feed.

The bias of sequential testing in predictive coding

June 25th, 2013

Text classifiers (or predictive coders) are in general trained iteratively, with training data added until acceptable effectiveness is achieved. Some method of measuring or estimating effectiveess is required—or, more precisely, of predicting effectiveness on the remainder of the collection. The simplest way of measuring classifier effectiveness is with a random sample of documents, from which a point estimate and confidence bounds on the selected effectiveness metric (such as recall) are calculated. It is essential, by the way, to make sure that these test documents are not also used to train the classifier, as testing on the training examples greatly exaggerates classifier effectiveness. (Techniques such as cross-validation simulate testing from training examples, but we'll leave them for a later discussion.)
Read the rest of this entry »

Non-authoritative relevance coding degrades classifier accuracy

June 21st, 2013

There has been considerable attention paid to the high level of disagreement between assessors on the relevance of documents, not least on this blog. This level of disagreement has been cited to argue in favour of the use of automated text analytics (or predictive coding) in e-discovery: not only do humans make mistakes, but they may make as many as or more than automated systems. But automated systems are only as good as the data used to train them, and production managers have an important choice to make in generating this training data. Should training annotations be performed by an expert, but expensive, senior attorney? Or can it be farmed out to the less expensive, but possibly less reliable, contract attorneys typically used for manual review? This choice comes down to a trade-off between cost and reliability—though ultimately reliability itself can be (at least partly) reduced to cost, too. The cost question still needs to be addressed; but Jeremy Pickens (of Catalyst) and myself have made a start on the question of reliability in our recent SIGIR paper, Assessor Disagreement and Text Classifier Accuracy.
Read the rest of this entry »

Why 95% +/- 2% makes little sense for e-discovery certification

May 25th, 2013

It is common in e-discovery protocols to see a requirement that the production be certified with a "95% +/- X%" sample (where "X%" takes on values such as "2%" or "5%"), leading to a required sample size being specified up front. (See, for instance, the ESI protocol that was recently debated in the ongoing Da Silva Moore case.) This approach, however, makes little sense, for two reasons. First, it specifies an accuracy in our measure, when what we want to specify is some minimal level of performance. And second, decisions about sample size and allocation should be delayed until after the (candidate) production is ready, when they can be made much more efficiently and effectively.
Read the rest of this entry »

What is the maximum recall in re Biomet?

April 24th, 2013

There has been a flurry of interest the past couple of days over Judge Miller's order in re Biomet. In their e-discovery process, the defendants employed a keyword filter to reduce the size of the collection, and input only the post-filtering documents to their vendor's predictive coding system, which seems to be a frequent practice at the current stage of adoption of predictive coding technology. The plaintiffs demanded that instead the defendants apply predictive coding to the full collection. Judge Miller found in favour of the defendants.
Read the rest of this entry »

Stratified sampling in e-discovery evaluation

April 18th, 2013

Point- and lower-bound confidence estimates on the completeness (or recall) of an e-discovery production are calculated by sampling documents, from both the production and the remainder of the collection (the null set). The most straightforward way to draw this sample is as a simple random sample (SRS) across the whole collection, produced and unproduced. However, the same level of accuracy can be achieved for a fraction of the review cost by using stratified sampling instead. In this post, I introduce the use of stratified sampling in the evaluation of e-discovery productions. In a later post, I will provide worked examples, illustrating the saving in review cost that can be achieved.
Read the rest of this entry »

Does automatic text classification work for low-prevalence topics?

January 26th, 2013

Readers of Ralph Losey's blog will know that he is an advocate of what he calls "multimodal" search in e-discovery; that is, a search methodology that involves a mix of human-directed keyword search, human-machine blended concept search, and machine-directed text classification (or predictive coding, in the e-discovery jargon). Meanwhile, he deprecates the alternative model of machine-driven review, in which only text classification technology is employed, and the human's sole function is to code machine-selected documents as responsive or non-responsive. The difference between the two modes can be seen most clearly in the creation of the seed set (that is, the initial training set created to bootstrap the text classification process). In multimodal review, the seed set might be taken from (a sample of) the results of active search by the human reviewer; in machine-driven review, the seed set is formed by a random sample from the collection.
Read the rest of this entry »

Why confidence intervals in e-discovery validation?

December 9th, 2012

A question that often comes up when discussing e-discovery validation protocols is, why should they be based on confidence intervals, rather than point estimates? That is, why do we say, for instance, "the production will be accepted if we have 95% confidence that its recall is greater than 60%"? Why not just say "the production will be accepted if its estimated recall is 75%"? Indeed, there have been some recent protocols that take the latter, point-estimate approach. (The ESI protocol of Global Aerospace, Inc. v. Landow Aviation, for instance, states simply that 75% recall shall be the "acceptable recall criterion", without specifying anything about confidence levels.) Answering this question requires some reflection on what we are trying to achieve with a validation protocol in the first place.
Read the rest of this entry »

The environmental consequences of SIGIR

December 4th, 2012

As it is becoming apparent that, without drastic immediate action, we are going to significantly overshoot greenhouse gas emission targets and warm the planet by an environmentally disastrous 4 to 5 degrees centigrade by the end of the century, I thought I should fulfil my long-standing promise to myself and calculate the carbon emissions generated by the annual SIGIR conference. I'm only going to consider here the emissions caused by air travel; but this is likely to be the overwhelming majority of total emissions.
Read the rest of this entry »