Random vs active selection of training examples in e-discovery

July 17th, 2014

The problem with agreeing to teach is that you have less time for blogging, and the problem with a hiatus in blogging is that the topic you were in the middle of discussing gets overtaken by questions of more immediate interest. I hope to return to the question of simulating assessor error in a later post, but first I want to talk about an issue that is attracting attention at the moment: how to select documents for training a predictive coding system.
Read the rest of this entry »

Can you train a useful model with incorrect labels?

February 25th, 2014

We, in this blog, are in the middle of a series of simulation experiments on the effect of assessor error on text classifier reliability. There's still some way to go with these experiments, but in the mean time the topic has attracted some attention on the blogosphere. Ralph Losey has forcefully re-iterated his characterization of using non-experts to train a predictive coding system as garbage in, garbage out, a position which he regards Jeremy Pickens and myself as disagreeing with. Jeremy Pickens, meanwhile, has responded by citing Catalyst experiments on TREC data that show (remarkably) that a model trained even entirely with incorrect labels can be almost as useful as one trained by an expert.
Read the rest of this entry »

Assessor error and term model weights

January 3rd, 2014

In my last post, we saw that randomly swapping training labels, in a (simplistic) simulation of the effect of assessor error, leads as expected to a decline in classifier accuracy, with the decline being greater for lower prevalence topics (in part, we surmised, because of the primitive way we were simulating assessor errors). In this post, I thought it would be interesting to look inside the machine learner, and try to understand in more detail what effect the erroneous training data has. As we'll see, we learn something about how the classifier works by doing so, but end up with some initially surprising findings about the effect of assessor error on the classifier's model.
Read the rest of this entry »

Annotator error and predictive reliability

December 27th, 2013

There has been some interesting recent research on the effect of using unreliable annotators to train a text classification or predictive coding system. Why would you want to do such a thing? Well, the unreliable annotators may be much cheaper than a reliable expert, and by paying for a few more annotations, you might be able to achieve equivalent effectiveness and still come out ahead, budget-wise. Moreover, even the experts are not entirely consistent, and we'd like to know what the effect of these inconsistencies might be.
Read the rest of this entry »

Repeated testing does not necessarily invalidate stopping decision

November 19th, 2013

Thinking recently about the question of sequential testing bias in e-discovery, I've realized an important qualification to my previous post on the topic. While repeatedly testing an iteratively trained classifier against a target threshold will lead to optimistic bias in the final estimate of effectiveness, it does not necessarily lead to an optimistic bias in the stopping decision.
Read the rest of this entry »

Sample-based estimation of depth for recall

November 6th, 2013

In my previous post, I advocated the used of depth for recall as a classifier effectiveness metric in e-discovery, as it directly measures the review cost of proceeding to production with the current classifier. If we know where all the responsive documents are in the ranking, then calculating depth for Z recall is straightforward: it is simply the position of the Z'th responsive document in the responsive ranking. In practice, however, we don't know the responsive documents in advance (if we did, they'd be no need for the predictive review). Instead, depth for recall must be estimated.
Read the rest of this entry »

Total annotation cost should guide automated review

October 28th, 2013

One of the most difficult challenges for the manager of an automated e-discovery review is knowing when enough is enough; when it is time to stop training the classifier, and start reviewing the documents it predicts to be responsive.

Unfortunately, the guidance the review manager receives from their system providers is not always as helpful as it could be. After each iteration of training, the manager may be shown a graph of effectiveness, like so:
Read the rest of this entry »

Relevance density affects assessor judgment

September 11th, 2013

It is somewhat surprising to me that, having gone to the University of Maryland with the intention of working primarily on the question of assessor variability in relevance judgment, I did in fact end up working (or at least publishing) primarily on the question of assessor variability in relevance judgment. The last of these publications, "The Effect of Threshold Priming and Need for Cognition" (Scholer, Kelly, Wu, Lee, and Webber, SIGIR 2013), was in some ways the most satisfying, for the opportunity to collaborate with Falk Scholer and Diane Kelly (both luminaries in this field), and for the careful experimental design and analysis involved.
Read the rest of this entry »

Measuring incremental cost-to-production in predictive coding

August 14th, 2013

I had the opportunity on Monday of giving a talk on processes for predictive coding in e-discovery to the Victorian Society for Computers and the Law. The key novel suggestion of my talk was that the effectiveness of the iteratively-trained classifier should be measured not (only) by abstract metrics of effectiveness such as F score, but (also) directly by the cost / benefit tradeoff facing the production manager. In particular, I advocated a new ranking metric, depth for recall.
Read the rest of this entry »

Change of career, change of name

August 13th, 2013

This blog has followed by own research interests in becoming increasing focused upon evaluation and technology question in e-discovery, rather than in information retrieval more generally. Now my own career has followed my interests out the ivy-clad gates of academia and into private consulting in e-discovery. In recognition of these changes, I've also changed the name of my blog, from "IREvalEtAl" to "Evaluating E-Discovery". There will be some other cosmetic changes to follow, but (for now at least) we're at the same URL and on the same RSS feed.