My previous post described in some detail the conditions of finite population annotation that apply to e-discovery. To summarize, what we care about (or at least should care about) is not maximizing classifier accuracy in itself, but minimizing the total cost of achieving a target level of recall. The predominant cost in the review stage is that of having human experts train the classifier, and of having human reviewers review the documents that the classifier predicts as responsive. Each relevant document found in training is one fewer that must be looked at in review. Therefore, training example selection methods such as relevance selection that prioritize relevant documents are likely to have a lower total cost than the abstract measure of classifier effectiveness might suggest.
In this post, I am going to evaluate three selection methods—random, uncertainty, and relevance—by the total review cost required to achieve a target recall. For simplicity, I'll assume that every document costs the same amount to be labelled, whether in training or review. I'll set the target recall level at 80%, and refer to the cost metric as cost for recall 80 (cfr80). (Cormack and Grossman refer to the same measure as "80% recall effort".) A run consists of a certain amount of training, followed by review to the depth that achieves 80% recall. The total cost of a run is the training set size plus the review depth. A sequence of (potential) training iterations defines a set of runs, each of which switches from training to review at a given iteration, and each of which has its own cost. The run with the lowest cost is the one with the optimal point of switching from training to review (that is, the optimal training set size). The cost of the set of runs can be represented as a curve; the minimum cost is then the lowest point in this curve. One could also continue training until target recall was achieved in the training set alone, as in continuous training; this strategy is particularly attractive for the "relevance" selection method. An interesting question, then, is if the cfr80 achieved by relevance selection under continuous training is the minimum cfr80 for its run set.
Each training sequence begins with a randomly-sampled seed set. For the uncertainty and relevance selection methods, random sampling continues until at least one relevant document enters the training set, at which point we switch to the appropriate selection method (for random selection, of course, random sampling continues throughout). As with previous experiments, the RCV1v2 dataset with its 103 topics is used again. As implied by the finite population scenario, training and target sets are not held distinct; instead, I take 700,000 of the 804,414 RCV1v2 documents as the single, undifferentiated collection. (I'm holding the remaining 104,414 documents in reserve for experiments on sampling and estimation.) The classifier is Vowpal Wabbit 7.4, with logistic loss, a learning rate of 2.0, and a single pass. Training iterations are by steps of 100 to 2000; then of 200 to 5,000; then of 500 to 20,000; and finally of 2,000 to the end.
Let's begin with the cost curves for a particular topic, Topic E41.
There's a fair bit of information in this figure, so we'll take things bit by bit. The title provides the richness of the topic, along with the number of relevant documents that need to be located to achieve our target of 80% recall. (This target number is richness * 700,000 * 0.8, with 700,000 being the collection size.) The x axis states the training set size as a multiple of the target size, to facilitate comparison between topics of different richnesses. The y axis gives the cost to achieve 80% recall, again as a multiple of target size. (Note the log scale on the y axis, giving greater clarity to the curve minima.) The absolute minimum of relative cost for recall is 1.0, which would occur if all the documents seen, both in training and review, were relevant.
The green, blue, and red curves provide the relative cost for recall of the uncertainty, random, and relevance selection methods, given the training set size in the x axis. Lower values on this curve are better (indicating less total review cost). Observe that the (green) curve for the uncertainty method falls sharply with the initial training runs, indicating the quality of the predictive ranking rapidly improves, and so review depth quickly decreases. The improvement slows over time, however, and after a relative training set size around 0.5 (that is, a training set around 11,863 * 0.5 ≈ 6,000), the total cost curve begins to rise. This does not (necessarily) mean that the quality of the predictive ranking is actually getting worse; rather, it indicates that the cost in training exceeds the saving in review. In contrast, the cost curve for the relevance selection method decreases much more slowly; until relative training set size is almost 1.0 (≈ 11,000), relevance selection does not consistently beat random selection. After that, though, it drops quickly, and goes below the uncertainty sampling curve.
The trajectories of the cost curves for the different methods illustrate how the methods behave; what we really care about, however, is the minimum point in each curve, as this gives the actual total review cost that would be achieved (assuming, as we do throughout this post, perfect knowledge). These minimum costs are marked as notations along the y axis. Here, we can see that the minimum cfr80 values of the uncertainty and relevance methods are quite similar (the actual minimum relative cfr80 values are 1.36 and 1.41 respectively), and certainly both are much lower than the minimum for random selection (4.08). The protocol that achieves this minimum for uncertainty selection is to spend around 40% of the effort in training, and 60% in review; for relevance selection, to spend most of or all the effort in training; but total cost is much the same either way.
A question posed in my previous post was whether the relevance selection method did indeed achieve its optimal cost by following a continuous training protocol (that is, continuing to train until target recall is achieved in the training set, without a separate review step). This question can be answered (somewhat indirectly) by considering the line at which training set size and cfr80 are the same (shown as the dotted grey curve in the figure---curved because of the log y axis). When the cost curve of a method touches or cross this line, the training set by itself achieves target recall. Here, the relevance cost curve touches the train==cost curve at or near the cost curve's minimum, indicating that continuous training achieves
(near-)minimum cost on this set of runs.
Having understood the figure format for an example topic, let's look at a larger set of topics.
The above figure shows the cost curves for 14 randomly selected topics. (Selection was as follows: 3 randomly selected from topics with richness < 0.5% [top row]; 3 from topics with richness between 0.5% and 2% [second row]; 3 between 2% and 5% [third row]; 3 between 5% and 15% [fourth row]; and two above 15% richness [bottom row].) Cost curves for all 103 topics can be found here.
Considering the cost curves themselves, the general pattern is for uncertainty and relevance selection to track each other very closely for topics with richness below 1%. For richness above 1%, the uncertainty cost curve tends to fall faster and earlier than the relevance curve, with this tendency becoming more pronounced as richness increases; indeed, for topics with richness above 5%, relevance selection frequently leads to initially rising cost curves.
Considering minimum review cost, however, the difference in effort between relevance and uncertainty selection is only slight; only 3 topics show a difference of more than 13% one way or the other. That is, relevance and uncertainty selection have different trajectories, increasingly so for higher richness, calling for different protocols (for uncertainty, limited training followed by review; for relevance, continuous training throughout), but end up with very similar total costs.
Additionally, there are only three topics, all low richness ones (G159 [0.00% rich], E61 [0.05% rich], and C16 [0.24% rich]), in which the optimal stopping point for relevance selection is well before (> 10%) the stopping point for continuous training (that is, when recall is achieved by the training set alone). That is, continuous training is indeed the correct protocol to use for relevance selection. (Moreover, particularly for higher prevalence topics, using relevance selection without following continuous ranking through to the end can lead to very poor results.) For around a third of the topics, continuous training also gives total cost within 10% of optimal for uncertainty selection, too, but these are almost exclusively low richness topics (richness < 1%), where, as previously observed, the uncertainty and relevance cost curves are very similar. (It may well be that the two methods in fact uncover very much the same relevant documents during training, though I have not yet verified this.)
Meanwhile, it is clear from these graphs that random selection is a very expensive method for topics with low richness. It is only for topics with richness above 2.5% that one can be reasonably confident that random selection will not lead to more than twice the total review cost of uncertainty selection, and only for topics with richness above 10% that the additional cost is consistently less than 20%. There are benefits in random selection beyond total cost for binary recall—its simplicity, its lack of bias, the ability (with care) to use training data for estimation and testing purposes—but I would suggest that once richness goes substantially below 5%, these benefits are outweighed by the additional cost.
Finally, let's summarize these minimum total cost results, grouped by topic richness. Again, total cost is expressed as a multiple of the number of relevant documents needed to achieve 80% recall:
|0.5% -- 2%||10.26||2.24||2.26|
|2% -- 5%||3.14||1.64||1.66|
|5% -- 15%||1.71||1.26||1.28|
Clearly, all methods struggle with extremely low richness topics. (For such topics, we really require a better seed set selection method than random sampling, a question that I hope to return to in a later post). There is no discernible difference in total cost between relevance and uncertainty selection. And encouragingly, the total review cost for these methods is only a little above 2 times the number of relevant documents for low richness topics, and less for higher ones (albeit on what is, in e-discovery terms, a very clean collection with relatively straightforward topics). Random selection does almost as well as the active methods for high richness topics, but becomes increasingly inefficient as topic richness falls.
Although there is little difference in minimum total cost between uncertainty and relevance selection, the trajectories by which the two methods achieve these similar total costs are quite different. As mentioned in my previous post, the continuous training method that relevance selection supports has a simpler stopping decision than uncertainty selection (one-, rather than two-dimensional). On the other hand, the generally flat minimum of uncertainty selection, combined with the gentle rise in total cost once this minimum is passed, suggests that the first stopping decision in uncertainty selection (that is, when to stop training) may not be too sensitive, while uncertainty selection may be more flexible in allowing the separation of training and review. The cost curves also suggest the attractiveness of a composite strategy, for instance uncertainty selection at the start, moving to relevance selection later on (perhaps joined with some random selection for high-richness topics).
It is important to note, however, that the above discussion paints an overly clean picture, because of the assumption of perfect knowledge of cost for recall at any point, and of the trajectory of that cost with further training. In reality, we lack such perfect knowledge, and have to rely on estimation methods, rules of thumb, and experience. Actual decisions about strategy and stopping points are made in a fog of uncertainty; and this fog becomes thicker as richness drops and sampling methods lose traction. This, too, will be a subject of future posts.
In the above table showing average costs by different selection methods for topics of different richness bands, the random method is shown as cheaper on average than the other two methods for topics with richness below 0.5%. In fact, random is considerably more expensive on almost all such topics. However, on the two topics with the lowest richness, random is slightly cheaper; and the costs of these two topics are so high, that these two topics outweigh all the others in the average. Now, these two topics have 5 and 31 relevant documents respectively out of the population of 700,000; that is, they have richness below 0.01%. Why the random method does better on these two topics is unclear; it may simply be due to chance. In any case, such extremely low richness topics are doubtful choices for predictive coding altogether, and certainly very poor choices for random seed sets. If place topics with a richness lower than 0.05% (1 document in 2,000) in a separate category, we get the following table:
|0.05% -- 0.5%||78.62||11.41||11.32|
|0.5% -- 2%||10.26||2.24||2.26|
|2% -- 5%||3.14||1.64||1.66|
|5% -- 15%||1.71||1.26||1.28|
Thanks to Ralph Losey for pointing out this anomaly (and urging me to do something about it!).