Total annotation cost should guide automated review

One of the most difficult challenges for the manager of an automated e-discovery review is knowing when enough is enough; when it is time to stop training the classifier, and start reviewing the documents it predicts to be responsive.

Unfortunately, the guidance the review manager receives from their system providers is not always as helpful as it could be. After each iteration of training, the manager may be shown a graph of effectiveness, like so:

F1 learning curve, for 200,000 documents sampled from RCV1-v2 topic GCRIM.  Classifier is Vowpal Wabbit 6.1, with logistic loss function and 10 learning passes.  Training documents selected by simple random sampling.

F1 learning curve, for 200,000 documents sampled from RCV1-v2 topic GCRIM. Classifier is Vowpal Wabbit 6.1, with logistic loss function and 10 learning passes. Training documents selected by simple random sampling.

and advised to keep training until the classifier "stops improving".

A general measure of classifier effectiveness like the F measure, however, does not answer the key question the review manager has: If I stop training now, how much review effort will be required to achieve a reasonably and proportionately complete production?

A measure of classifier effectiveness that directly answers this question is how much of the predictive ranking must be reviewed to achieve a certain recall level, Z. I refer to his metric as depth for Z recall.

For the same run as shown in the F measure curve above, the depth for 80% recall curve looks like so:

Depth for 80% recall.  Other details are as for previous figure.

Depth for 80% recall. Other details are as for previous figure.

So, after 1,000 training documents have been annotated, the top 43,000 documents in the 200,000-document predictive ranking must be reviewed to achieve 80% recall; going to 4,000 training documents reduces the review task to 17,000 documents; while training on the next 6,000 documents only reduces the review task by 4,500 documents.[*]

Exactly what the most cost-effective stopping point is depends upon the relative cost of training (frequently done by expensive subject matter experts) and review (generally performed by cheaper contract attorneys). The more expensive training is relative to review, the soon the review manager will want to switch from one to the other.

We can measure this cost equation directly[*]. Say that training costs $10 per document (a SME charging $600 an hour and reviewing a document a minute), while review costs $1 per document (a contractor charge $60 an hour and working at the same rate). Then the total annotation cost (training plus review) for our example training run looks like this:

Total cost, assuming $10 training and $1 review per document.  Other details are as for previous figure.

Total cost, assuming $10 training and $1 review per document. Other details are as for previous figure.

Total annotation cost is minimized at around $55,000 if training is stopped somewhere from 1,000 to 4,000 documents. Going beyond 4,000 training documents starts to push the total annotation cost back up, even though the F curve shows that the classifier is still improving.

Of course, the review manager only sees the above curve for the training iterations performed to date:

Animated version of previous figure.  200 training documents per step up to 5400, then every 1000 from 6000 to 10000.

Animated version of previous figure. 200 training documents per step up to 5400, then every 1000 from 6000 to 10000.

And from this cost curve, seen only to the current training iteration, the manager must answer another question, on which the stopping decision fundamentally lies: will further training reduce overall cost? Giving a mathematically precise prediction in answer to this question is difficult, but a simple visual rule of thumb (informed by the system trainer's experience and other evidence from the review process) is that, if total cost seems to be consistently climbing, it is time to stop.

A trickier issue is that in practice we don't know which are the responsive documents in advance (if we did, we wouldn't be paying for a review), and so we don't know the true shape of any of the above learning or cost curves. The best we can do is to estimate these curves, based upon a sample. How we can do that will be the subject of my next post.


[*] To keep the discussion simple, we're eliding the fact that responsive documents added to the training set also contribute to recall. We're also eliding other complications of a real production, such as rolling collections, interleaving of training and test, assessor inconsistencies, and so forth.

[+] The idea of minimizing total review costs comes from joint work with Dave Lewis, Doug Oard, and Mossaab Bagdouri. Exploiting the differential costs of expert and non-expert review, originates from Jeremy Pickens of Catalyst, and is introduced in our SIGIR 2013 paper, Assessor Disagreement and Text Classifier Accuracy; this paper also marks my first use of depth for recall as a metric of user effort. A slightly different approach to total annotation cost, based on training to a target F score and then allocating further annotations to a validation test set, is presented in the forthcoming CIKM paper by Mossaab Bagdouri, myself, Dave Lewis, and Doug Oard, entitled Towards Minimizing the Annotation Cost of Certified Text Classification.

Leave a Reply