Text classifiers (or predictive coders) are in general trained iteratively, with training data added until acceptable effectiveness is achieved. Some method of measuring or estimating effectiveess is required—or, more precisely, of predicting effectiveness on the remainder of the collection. The simplest way of measuring classifier effectiveness is with a random sample of documents, from which a point estimate and confidence bounds on the selected effectiveness metric (such as recall) are calculated. It is essential, by the way, to make sure that these test documents are not also used to train the classifier, as testing on the training examples greatly exaggerates classifier effectiveness. (Techniques such as cross-validation simulate testing from training examples, but we'll leave them for a later discussion.)
The test sample for a predictive coding run is commonly drawn once at the outset. Such a sample is referred to as a control sample. Once drawn, the control sample is held fixed, and the iteratively-trained classifier tested against it. A variant is to grow both the training and the test sets simultaneously, perhaps to optimize the measurement–performance tradeoff, or to capture in the test set any evolution in the conception of relevance that occurs during training. A third combination is to hold the training set fixed, and increase only the test set, intermixed with repeated testing—not a common setup for a control sample, but a possible approach for certifying an e-discovery production, to achieve acceptable measurement precision with minimal effort.
Unfortunately, as Mossaab Bagdouri, Dave Lewis, Doug Oard, and I show in our recent paper, Sequential Testing in Classifier Evaluation Yields Biased Estimates of Effectiveness, all of these methods produce biased estimates, tending to exaggerate the effectiveness of the classifier being estimated. The bias arises because the methods all involve sequential testing, in which one tests repeatedly until some goal is achieved. Though the individual tests may be unbiased (that is, the average estimate they give is accurate), they are subject to random error; and the combination of random error, and the repeated testing for a threshold, results in a biased estimate.
Sequential testing bias is easiest to observe in a fourth testing regime, where a new test sample is drawn at each iteration (easiest, because sampling error is independent between tests). Such a regime occurs where each iteration's test examples become the next iteration's training. Let's take our stopping criterion as being that estimated recall (at some cutoff) reaches a threshold of 75%. To keep the example simple, we'll make the assumption that true classifier recall improves linearly with training examples (in reality, the learning curve is more shoulder-shaped, and has considerable local variability). Then our notional learning curve might look like so:
Here, true effectiveness reaches the threshold after 1500 training examples. But in practice we don't see the actual learning curve, only sample-based estimates of it, which are sometimes higher, sometimes lower than true recall (as we randomly over- or under-sample examples the classifier labels correctly). For simplicity, assume that the likely degree of error is the same at every sampling point (to be precise, assume normally-distributed error with a 0.1 standard deviation). Training and testing in iterations of 50 documents, a production run might look like so:
The production manager sees only the recall estimates (the x's above), and stops the run at the first estimate above the 75% threshold. Each estimate is as likely to overstate as it is to understate recall; but the estimate which first falls above the threshold is more likely to be an overestimate. For the run depicted, the first above-threshold estimate occurs with 1250 training documents; we stop with estimated recall of 77%, but actual recall of 62%.
Repeating the above experiment many times, the probability of stopping at a given actual recall is distributed as follows:
The mean true stopping recall is 68%, 7% below the threshold of 75%; the probability that we stop with an actual recall at or above the threshold is only 20%.
All three of the sequential testing methods we examine in our paper suffer from a similar, optimistic bias. The bias is no surprise for fixed training, growing test, as this is the classic scenario for sequential methods (though, regrettably, the standard treatments do not trivially extend to composite measures like recall). The cause of bias with fixed test and growing training is less obvious; it lies not in varying random error in the test set, but (what is more difficult to quantify) random variability in the interaction between the fixed test and expanding training sets, as we by chance select training examples that make the classifier perform better or worse on the test set than on the collection as a whole. For the setup explored in our paper, a confidence interval that nominally had a 95% chance of covering the true score value (for us, F1), in fact only had 92% coverage under sequential testing; but the precise degree of optimism will vary depending upon collection, task, frequency of testing, and other factors.
What are the implications of this finding for e-discovery practice? First, and most importantly, a control sample used to guide the producing party's process cannot also be used to provide a statistically valid estimate of that process's result. So if the producing party says "we kept training until our control sample estimated we had 75% recall", you should expect the production actually to have less than 75% recall. The producing party can make various heuristic adjustments for their own internal purposes, and we provide some empirical adjustments in our paper. But if a statistically valid estimate of true production effectiveness is required—and we should think long and hard before we concede that it is not—then a separate certification sample must be made.