In a previous post, I compared three methods of selecting training examples for predictive coding—random, uncertainty and relevance. The methods were compared on their efficiency in improving the accuracy of a text classifier; that is, the number of training documents required to achieve a certain level of accuracy (or, conversely, the level of accuracy achieved for a given number of training documents). The study found that uncertainty selection was consistently the most efficient, though there was no great difference betweein it and relevance selection on very low richness topics. Random sampling, in contrast, performs very poorly on low richness topics.
In e-discovery, however, classifier accuracy is not an end in itself (though many widely-used protocols treat is as such). What we care about, rather, is the total amount of effort required to achieve an acceptable level of recall; that is, to find some proportion of the relevant documents in the collection. (We also care about determining to our satisfaction, and demonstrating to others, that that level of recall has been achieved—but that is beyond the scope of the current post.) A more accurate classifier means a higher precision in the candidate production for a given level of recall (or, equivalently, a lesser cutoff depth in the predictive ranking), which in turn saves cost in post-predictive first-pass review. But training the classifier itself takes effort, and after some point, the incremental saving in review effort may be outweighted by the incremental cost in training.