In a previous post, I compared three methods of selecting training examples for predictive coding—random, uncertainty and relevance. The methods were compared on their efficiency in improving the accuracy of a text classifier; that is, the number of training documents required to achieve a certain level of accuracy (or, conversely, the level of accuracy achieved for a given number of training documents). The study found that uncertainty selection was consistently the most efficient, though there was no great difference betweein it and relevance selection on very low richness topics. Random sampling, in contrast, performs very poorly on low richness topics.
In e-discovery, however, classifier accuracy is not an end in itself (though many widely-used protocols treat is as such). What we care about, rather, is the total amount of effort required to achieve an acceptable level of recall; that is, to find some proportion of the relevant documents in the collection. (We also care about determining to our satisfaction, and demonstrating to others, that that level of recall has been achieved—but that is beyond the scope of the current post.) A more accurate classifier means a higher precision in the candidate production for a given level of recall (or, equivalently, a lesser cutoff depth in the predictive ranking), which in turn saves cost in post-predictive first-pass review. But training the classifier itself takes effort, and after some point, the incremental saving in review effort may be outweighted by the incremental cost in training.
In fact, the real goal is not to find a certain proportion of the collection's relevant documents in (post-predictive) review, but to find that proportion by whatever means available. Put another way, the collection is of finite size, and so our recall target amounts to finding a finite number of relevant documents (though knowing what this number is is a tricky question in practice); a task Dave Lewis refers to as finite population annotation. Each relevant document found prior to review—in particular, in training—is one fewer that has to be found in the review stage. (Of course, the precise cost saving depends upon the protocol employed. If we have expensive experts train, and cheap reviewers do the first-pass review, then the saving is less than 1:1. If, though, every relevant document found in first-pass review is then checked by an expert reviewer in second-pass review, the saving may be greater than 1:1.)
Going further, the (near-)equivalence of finding relevant documents in training or review can be taken to its logical conclusion by entirely removing the separate review stage, and continuing the training until all relevant documents are found. Such a process is referred to as "continuous ranking" by Catalyst, or "continuous (active) learning" by Cormack and Grossman. Other vendors refer to it as "prioritization" (though frequently the prioritization stage follows a more regular training stage, and may be performed as first+second pass review rather than single-reviewer training). Relevance selection is primarily motivated by this mode of processing, though it might be that mixing in other selection types would give better efficiency (this certainly is Catalyst's approach), or conversely that periods of relevance selection would be optimal in a process with separate training and review steps. (Another way of looking at this process, when used with relevance selection, is that we skip training and start review immediately, but then feed back the reviewed documents to a predictive coding system to improve the review order.)
With the above discussion in mind, we can identify three possible (not necessarily mutually exclusive) protocols to achieve a target level or recall:
- Train until classifier accuracy "plateaus" (that is, stops improving noticeably); then review the subsequent predictive ranking to the depth required to achieve recall. I'll refer to this as "train–plateau–review", or TPR for short.
- After each round of training, look at the current predictive ranking, and try to determine whether it would be more efficient to do more training to improve this ranking, or to review the ranking as it is. I'll refer to this as "train–rank–review", or TRR for short.
- Continue "training" until you've achieved target recall in the positive training examples alone. I'll refer to this protocol as "continuous training", or CT for short.
(A fourth—or perhaps we should say zero'th—protocol is widely used, which is to train to the plateau, then review all and only the documents to which the classifier gives a positive prediction of relevance. This protocol, however, does not guarantee that target recall is achieved; one has to be satisfied with whatever level of recall is embodied in the classifier's set of positive prediction. The other protocols give us the option of sacrificing precision—that is, reviewing further down the predictive ranking—in order to achieve target recall.)
Note that these three protocols involved stopping decisions of differing complexities. The CT protocol has a single-dimensional stopping decision: have we achieved recall in the training set? The TRR protocol requires a two-dimensional stopping decision: how far down the current ranking do we have to go to achieve recall; and would additional training reduce this depth by more than the cost of the training itself? (I owe this distinction and its visualization in terms of dimensions to Gord Cormack). The complexity of the TPR protocol is intermediate, and can be likened to two sequential one-dimensional decisions: first, has the classifier stopped improving?; and second (and only then), what cutoff in the resultant ranking achieves recall?
The significance of TRR's greater stopping-rule complexity over CT depends on the sensitivity of the overall cost to an inaccurate choice in TRR's first dimension; that is, to training moderately too little or moderately too much. If the sensitivity is not great, then TRR's additional complexity is not of great concern; we might, for instance, look at total cost estimates in training iterations to date, and stop training when these estimates start trending up (an approach which would also assume that the historical cost curve is relatively smooth; that is, without sharp local minima that are distant from the global minimum). Furthermore, if it turns out that optimal training amount for TRR generally occurs near when classifier effectiveness plateaus, then the TPR and TRR protocols differ little in their practical results.
(Of course, the complexity also depends upon the method we use for estimating target recall. I'm avoiding the direct tackling of that problem here, though the general idea I have in mind is that we're using control-set sampling methods for estimation. In practice, all of these decisions will be made in a fog of uncertainty induced by our only seeing what happens on the sample control set, not on the full population. As a recent line of discussion has emphasized, this fog becomes increasingly thick as collection richness becomes increasingly low. Grossman and Cormack suggest in their recent law journal article that observed collection richness is dropping so low that reliable sampling becomes too expensive and we are forced to incorporate other sources of evidence to determine a defensible stopping point; if so, the question of stopping rule complexity becomes even more vexed.)
Hopefully, the above discussion has not been too abstract to follow. We'll be asking the above questions of real experimental data in subsequent posts, which should make the issues more concrete. This post, however, should not be regarded as a mere preliminary to the experimental work. Understanding the differences between these protocols, how they interact with selection methods, and their implications for practice is certainly fundamental to, but perhaps in itself more important than, the particular answers that experimental results produce. If we ask the wrong questions, we'll only end up with unhelpful answers.
Thanks to (in reverse alphabetical order) Jeremy Pickens, Dave Lewis, Paul Hunter, Maura Grossman, and Gord Cormack for their helpful comments on this post and general discussion of these issues. Of course, this post represents my opinions, not necessarily theirs.