Why training and review (partly) break control sets

October 20th, 2014

A technology-assisted review (TAR) process frequently begins with the creation of a control set---a set of documents randomly sampled from the collection, and coded by a human expert for relevance. The control set can then be used to estimate the richness (proportion relevant) of the collection, and also to gauge the effectiveness of a predictive coding (PC) system as training is undertaken. We might also want to use the control set to estimate the completeness of the TAR process as a whole. However, we may run into problems if we attempt to do so.

The reason the control set can be used to estimate the effectiveness of the PC system on the collection is that it is a random sample of that collection. As training proceeds, however, the relevance of some of the documents in the collection will become known through human assessment---even more so if review begins before training is complete (as is often the case). Direct measures of process effectiveness on the control set will fail to take account of the relevant and irrelevant documents already found through human assessment.

A naïve solution to this problem to exclude the already-reviewed documents from the collection; to use the control set to estimate effectiveness only on the remaining documents (the remnant); and then to combine estimated remnant effectiveness with what has been found by manual means. This approach, however, is incorrect: as documents are non-randomly removed from the collection, the control set ceases to be randomly representative of the remnant. In particular, if training (through active learning) or review is prioritized towards easily-found relevant documents, then easily-found relevant documents will become rare in the remnant; the control set will overstate effectiveness on the remnant, and hence will overstate the recall of the TAR process overall. (If training has been performed purely by random sampling, though, and no other review has been undertaken, then this approach is roughly accurate.)

To restore the representativeness of the control set, we need to remove from it the equivalent documents to those that have been removed from the collection by review. There is, however, no general way that of doing this: we can’t in general say which of the control set documents would have been reviewed, had they been part of the collection. There are, it is true, particular cases in which this removal rule can be determined. For instance, if a set of documents were selected for training or review by a keyword query, then the same documents should be excluded from the control set. Even if all documents selections were by similar rules, however, keeping track of all these rules over the course of a TAR process quickly becomes infeasible.

One case in which control set exclusion can be performed is prospectively, based upon a review cutoff decision. If we undertake to review all documents with a certain relevance score or higher (and if all the documents already reviewed are above that relevance score), then the part of the control set with relevance scores below this threshold could be used to estimate the number of relevant documents that would be excluded from the review. We cannot, however, form a statistically valid estimate of the number of relevant documents above this threshold, unless we ignore the evidence of the above-cutoff documents that have already been reviewed. (To see this, consider that the estimated number of relevant documents above the threshold might be less than the number we had already seen in review.) We also have no principled way of handling any reviewed documents that happen to fall below the threshold, except to make worst-case assumptions (in particular, to exclude such documents in estimating recall).

The above points do not mean that control sets are not useful in guiding decisions about a TAR process, including the important decision of when to stop training and how much further review might be required. However, care must be taken in deriving guidance from control sets in the presence of training and review, and even then the resulting estimates should be recognized as not being statistically valid ones (that is, not strictly justified by sampling theory). In particular, practitioners should be wary about the use of control sets to certify the completeness of a production---besides the sequential testing bias inherent in repeated testing against the one control set, and the fact that control set relevance judgments are made in the relative ignorance of the beginning of the TAR process. A separate certification sample should be preferred for making final assessments of production completeness.

Total assessment cost with different cost models

October 16th, 2014

In my previous post, I found that relevance and uncertainty selection needed similar numbers of document relevance assessments to achieve a given level of recall. I summarized this by saying the two methods had similar cost. The number of documents assessed, however, is only a very approximate measure of the cost of a review process, and richer cost models might lead to a different conclusion.

One distinction that is sometimes made is between the cost of training a document, and the cost of reviewing it. It is often assumed that training is performed by a subject-matter expert, whereas review is done by more junior reviewers. The subject-matter expert costs more than the junior reviewers---let's say, five times as much. Therefore, assessing a document for relevance during training will cost more than doing so during review.

With this differentiated cost model, we get the following average assessment costs (see my previous post for the interpretation of this table):

 

Richness Random Uncertainty Relevance Continuous
< 0.05% 19636.52 19010.29 19015.27 58589.09
0.05% -- 0.5% 211.56 50.72 52.14 61.69
0.5% -- 2% 23.65 7.34 9.64 11.68
2% -- 5% 5.04 3.76 6.55 8.50
5% -- 15% 2.22 1.88 4.34 6.54
> 15% 1.09 1.07 1.30 5.26


 

As one might expect, relevance selection (which aims to do much or all of the review effort during training) becomes more expensive than uncertainty selection. Moreover, whereas with a uniform cost model relevance selection was almost always best done in continuous mode (that is, all review is done by the trainer), it is frequently better with differentiated costs to stop relevance selection early and leave the tail of the documents for review.

This differentiated cost model assumes that (junior) reviewer decisions are final. In practice, however, that seems unlikely: if we trusted the cheaper junior reviewers so much, why not have them do the training as well? That suggests a third, QC step needs to be applied prior to production, giving us a three-step process of training, review, and production (or T-R-P). A simple form of pre-production QC would be that documents marked relevant by the junior reviewer are then sent for second-pass review by a more senior reviewer---perhaps the SME himself. In that case, we have three costs:

  1. Training documents: viewed once by SME, 5x cost
  2. Irrelevant reviewed documents: viewed once by junior reviewer, 1x cost
  3. Relevant reviewed documents: viewed first by junior reviewer, then by SME, 6x cost

Under this protocol, it is cheaper to handle a relevant document in training than in review (assuming that judgments made during training do not have to be checked during production). With this cost model, the total review costs of the different selection methods look like so:

 


Richness Random Uncertainty Relevance Continuous
< 0.05% 19641.38 19012.72 19018.25 58588.96
0.05% -- 0.5% 216.33 51.97 52.52 61.64
0.5% -- 2% 28.54 10.47 10.35 11.63
2% -- 5% 9.99 7.95 7.65 8.44
5% -- 15% 7.19 6.66 6.16 6.48
> 15% 6.08 6.05 5.17 5.18


 

Now, relevance selection is on average slightly cheaper than uncertainty selection, because there is less double-handling of responsive documents.

In the three-step, T-R-P protocol, the goal of the R step (of first-pass review) is to filter out non-responsive documents; that is, to increase precision. For the experiments described above, however, responsive documents make up the bulk of the assessment load. For topics with richness above 0.5%, average precision across the assessment process is around 65%. That seems to be substantially higher than what many vendors report for practical e-discovery cases. Perhaps the RCV1v2 collection is unrepresentatively "easy" as a text classification target. In these conditions, the R step has reduced value, as there are few irrelevant documents to filter out. If precision were lower---that is, if the number of non-responsive documents seen in assessment were higher---then the relative costs would be likely to change.

As noted above, however, the junior reviewers used in the R step will make errors. (The SME will make errors, too, but hopefully at a lower rate, and with more authority.) In particular, they will make errors of two types: false positives (judging irrelevant documents as relevant); and false negatives (judging relevant documents as irrelevant). The P-step approach described above, of only considering judged-relevant documents in second-pass review, will catch the false positives, but will miss the false negatives, lowering true recall. In practice, one would want to perform some form of QC of the judged-irrelevant documents, too; conversely, some proportion of the judged-relevant documents might be produced without QC.

The cost models presented in this post are simplistic, and other protocols are possible. Nevertheless, these results emphasize two points. First, conclusions about total assessment cost depend upon your cost model. But second, your cost model depends upon the details of the protocol you use (and other statistics of the assessment effort). Modeller beware!

Thanks to Rachi Messing, Jeremy Pickens, Gord Cormack, and Maura Grossman for discussions that helped inform this post. Naturally, the opinions expressed here are mine, not theirs.

Total review cost of training selection methods

September 27th, 2014

My previous post described in some detail the conditions of finite population annotation that apply to e-discovery. To summarize, what we care about (or at least should care about) is not maximizing classifier accuracy in itself, but minimizing the total cost of achieving a target level of recall. The predominant cost in the review stage is that of having human experts train the classifier, and of having human reviewers review the documents that the classifier predicts as responsive. Each relevant document found in training is one fewer that must be looked at in review. Therefore, training example selection methods such as relevance selection that prioritize relevant documents are likely to have a lower total cost than the abstract measure of classifier effectiveness might suggest.
Read the rest of this entry »

Finite population protocols and selection training methods

September 15th, 2014

In a previous post, I compared three methods of selecting training examples for predictive coding—random, uncertainty and relevance. The methods were compared on their efficiency in improving the accuracy of a text classifier; that is, the number of training documents required to achieve a certain level of accuracy (or, conversely, the level of accuracy achieved for a given number of training documents). The study found that uncertainty selection was consistently the most efficient, though there was no great difference betweein it and relevance selection on very low richness topics. Random sampling, in contrast, performs very poorly on low richness topics.

In e-discovery, however, classifier accuracy is not an end in itself (though many widely-used protocols treat is as such). What we care about, rather, is the total amount of effort required to achieve an acceptable level of recall; that is, to find some proportion of the relevant documents in the collection. (We also care about determining to our satisfaction, and demonstrating to others, that that level of recall has been achieved—but that is beyond the scope of the current post.) A more accurate classifier means a higher precision in the candidate production for a given level of recall (or, equivalently, a lesser cutoff depth in the predictive ranking), which in turn saves cost in post-predictive first-pass review. But training the classifier itself takes effort, and after some point, the incremental saving in review effort may be outweighted by the incremental cost in training.
Read the rest of this entry »

Research topics in e-discovery

August 8th, 2014

Dr. Dave Lewis is visiting us in Melbourne on a short sabbatical, and yesterday he gave an interesting talk at RMIT University on research topics in e-discovery. We also had Dr. Paul Hunter, Principal Research Scientist at FTI Consulting, in the audience, as well as research academics from RMIT and the University of Melbourne, including Professor Mark Sanderson and Professor Tim Baldwin. The discussion amongst attendees was almost as interesting as the talk itself, and a number of suggestions for fruitful research were raised, many with fairly direct relevance to application development. I thought I'd capture some of these topics here.
Read the rest of this entry »

Random vs active selection of training examples in e-discovery

July 17th, 2014

The problem with agreeing to teach is that you have less time for blogging, and the problem with a hiatus in blogging is that the topic you were in the middle of discussing gets overtaken by questions of more immediate interest. I hope to return to the question of simulating assessor error in a later post, but first I want to talk about an issue that is attracting attention at the moment: how to select documents for training a predictive coding system.
Read the rest of this entry »

Can you train a useful model with incorrect labels?

February 25th, 2014

We, in this blog, are in the middle of a series of simulation experiments on the effect of assessor error on text classifier reliability. There's still some way to go with these experiments, but in the mean time the topic has attracted some attention on the blogosphere. Ralph Losey has forcefully re-iterated his characterization of using non-experts to train a predictive coding system as garbage in, garbage out, a position which he regards Jeremy Pickens and myself as disagreeing with. Jeremy Pickens, meanwhile, has responded by citing Catalyst experiments on TREC data that show (remarkably) that a model trained even entirely with incorrect labels can be almost as useful as one trained by an expert.
Read the rest of this entry »

Assessor error and term model weights

January 3rd, 2014

In my last post, we saw that randomly swapping training labels, in a (simplistic) simulation of the effect of assessor error, leads as expected to a decline in classifier accuracy, with the decline being greater for lower prevalence topics (in part, we surmised, because of the primitive way we were simulating assessor errors). In this post, I thought it would be interesting to look inside the machine learner, and try to understand in more detail what effect the erroneous training data has. As we'll see, we learn something about how the classifier works by doing so, but end up with some initially surprising findings about the effect of assessor error on the classifier's model.
Read the rest of this entry »

Annotator error and predictive reliability

December 27th, 2013

There has been some interesting recent research on the effect of using unreliable annotators to train a text classification or predictive coding system. Why would you want to do such a thing? Well, the unreliable annotators may be much cheaper than a reliable expert, and by paying for a few more annotations, you might be able to achieve equivalent effectiveness and still come out ahead, budget-wise. Moreover, even the experts are not entirely consistent, and we'd like to know what the effect of these inconsistencies might be.
Read the rest of this entry »

Repeated testing does not necessarily invalidate stopping decision

November 19th, 2013

Thinking recently about the question of sequential testing bias in e-discovery, I've realized an important qualification to my previous post on the topic. While repeatedly testing an iteratively trained classifier against a target threshold will lead to optimistic bias in the final estimate of effectiveness, it does not necessarily lead to an optimistic bias in the stopping decision.
Read the rest of this entry »