Archive for the ‘Uncategorized’ Category

Annotator error and predictive reliability

Friday, December 27th, 2013

There has been some interesting recent research on the effect of using unreliable annotators to train a text classification or predictive coding system. Why would you want to do such a thing? Well, the unreliable annotators may be much cheaper than a reliable expert, and by paying for a few more annotations, you might be able to achieve equivalent effectiveness and still come out ahead, budget-wise. Moreover, even the experts are not entirely consistent, and we'd like to know what the effect of these inconsistencies might be.

Repeated testing does not necessarily invalidate stopping decision

Tuesday, November 19th, 2013

Thinking recently about the question of sequential testing bias in e-discovery, I've realized an important qualification to my previous post on the topic. While repeatedly testing an iteratively trained classifier against a target threshold will lead to optimistic bias in the final estimate of effectiveness, it does not necessarily lead to an optimistic bias in the stopping decision.

Sample-based estimation of depth for recall

Wednesday, November 6th, 2013

In my previous post, I advocated the used of depth for recall as a classifier effectiveness metric in e-discovery, as it directly measures the review cost of proceeding to production with the current classifier. If we know where all the responsive documents are in the ranking, then calculating depth for Z recall is straightforward: it is simply the position of the Z'th responsive document in the responsive ranking. In practice, however, we don't know the responsive documents in advance (if we did, they'd be no need for the predictive review). Instead, depth for recall must be estimated.

Total annotation cost should guide automated review

Monday, October 28th, 2013

One of the most difficult challenges for the manager of an automated e-discovery review is knowing when enough is enough; when it is time to stop training the classifier, and start reviewing the documents it predicts to be responsive.

Unfortunately, the guidance the review manager receives from their system providers is not always as helpful as it could be. After each iteration of training, the manager may be shown a graph of effectiveness, like so:

Relevance density affects assessor judgment

Wednesday, September 11th, 2013

It is somewhat surprising to me that, having gone to the University of Maryland with the intention of working primarily on the question of assessor variability in relevance judgment, I did in fact end up working (or at least publishing) primarily on the question of assessor variability in relevance judgment. The last of these publications, "The Effect of Threshold Priming and Need for Cognition" (Scholer, Kelly, Wu, Lee, and Webber, SIGIR 2013), was in some ways the most satisfying, for the opportunity to collaborate with Falk Scholer and Diane Kelly (both luminaries in this field), and for the careful experimental design and analysis involved.

Measuring incremental cost-to-production in predictive coding

Wednesday, August 14th, 2013

I had the opportunity on Monday of giving a talk on processes for predictive coding in e-discovery to the Victorian Society for Computers and the Law. The key novel suggestion of my talk was that the effectiveness of the iteratively-trained classifier should be measured not (only) by abstract metrics of effectiveness such as F score, but (also) directly by the cost / benefit tradeoff facing the production manager. In particular, I advocated a new ranking metric, depth for recall.

Change of career, change of name

Tuesday, August 13th, 2013

This blog has followed by own research interests in becoming increasing focused upon evaluation and technology question in e-discovery, rather than in information retrieval more generally. Now my own career has followed my interests out the ivy-clad gates of academia and into private consulting in e-discovery. In recognition of these changes, I've also changed the name of my blog, from "IREvalEtAl" to "Evaluating E-Discovery". There will be some other cosmetic changes to follow, but (for now at least) we're at the same URL and on the same RSS feed.

The bias of sequential testing in predictive coding

Tuesday, June 25th, 2013

Text classifiers (or predictive coders) are in general trained iteratively, with training data added until acceptable effectiveness is achieved. Some method of measuring or estimating effectiveess is required—or, more precisely, of predicting effectiveness on the remainder of the collection. The simplest way of measuring classifier effectiveness is with a random sample of documents, from which a point estimate and confidence bounds on the selected effectiveness metric (such as recall) are calculated. It is essential, by the way, to make sure that these test documents are not also used to train the classifier, as testing on the training examples greatly exaggerates classifier effectiveness. (Techniques such as cross-validation simulate testing from training examples, but we'll leave them for a later discussion.)

Non-authoritative relevance coding degrades classifier accuracy

Friday, June 21st, 2013

There has been considerable attention paid to the high level of disagreement between assessors on the relevance of documents, not least on this blog. This level of disagreement has been cited to argue in favour of the use of automated text analytics (or predictive coding) in e-discovery: not only do humans make mistakes, but they may make as many as or more than automated systems. But automated systems are only as good as the data used to train them, and production managers have an important choice to make in generating this training data. Should training annotations be performed by an expert, but expensive, senior attorney? Or can it be farmed out to the less expensive, but possibly less reliable, contract attorneys typically used for manual review? This choice comes down to a trade-off between cost and reliability—though ultimately reliability itself can be (at least partly) reduced to cost, too. The cost question still needs to be addressed; but Jeremy Pickens (of Catalyst) and myself have made a start on the question of reliability in our recent SIGIR paper, Assessor Disagreement and Text Classifier Accuracy.

Why 95% +/- 2% makes little sense for e-discovery certification

Saturday, May 25th, 2013

It is common in e-discovery protocols to see a requirement that the production be certified with a "95% +/- X%" sample (where "X%" takes on values such as "2%" or "5%"), leading to a required sample size being specified up front. (See, for instance, the ESI protocol that was recently debated in the ongoing Da Silva Moore case.) This approach, however, makes little sense, for two reasons. First, it specifies an accuracy in our measure, when what we want to specify is some minimal level of performance. And second, decisions about sample size and allocation should be delayed until after the (candidate) production is ready, when they can be made much more efficiently and effectively.