The bias of sequential testing in predictive coding

Text classifiers (or predictive coders) are in general trained iteratively, with training data added until acceptable effectiveness is achieved. Some method of measuring or estimating effectiveess is required—or, more precisely, of predicting effectiveness on the remainder of the collection. The simplest way of measuring classifier effectiveness is with a random sample of documents, from which a point estimate and confidence bounds on the selected effectiveness metric (such as recall) are calculated. It is essential, by the way, to make sure that these test documents are not also used to train the classifier, as testing on the training examples greatly exaggerates classifier effectiveness. (Techniques such as cross-validation simulate testing from training examples, but we'll leave them for a later discussion.)

The test sample for a predictive coding run is commonly drawn once at the outset. Such a sample is referred to as a control sample. Once drawn, the control sample is held fixed, and the iteratively-trained classifier tested against it. A variant is to grow both the training and the test sets simultaneously, perhaps to optimize the measurement–performance tradeoff, or to capture in the test set any evolution in the conception of relevance that occurs during training. A third combination is to hold the training set fixed, and increase only the test set, intermixed with repeated testing—not a common setup for a control sample, but a possible approach for certifying an e-discovery production, to achieve acceptable measurement precision with minimal effort.

Unfortunately, as Mossaab Bagdouri, Dave Lewis, Doug Oard, and I show in our recent paper, Sequential Testing in Classifier Evaluation Yields Biased Estimates of Effectiveness, all of these methods produce biased estimates, tending to exaggerate the effectiveness of the classifier being estimated. The bias arises because the methods all involve sequential testing, in which one tests repeatedly until some goal is achieved. Though the individual tests may be unbiased (that is, the average estimate they give is accurate), they are subject to random error; and the combination of random error, and the repeated testing for a threshold, results in a biased estimate.

Sequential testing bias is easiest to observe in a fourth testing regime, where a new test sample is drawn at each iteration (easiest, because sampling error is independent between tests). Such a regime occurs where each iteration's test examples become the next iteration's training. Let's take our stopping criterion as being that estimated recall (at some cutoff) reaches a threshold of 75%. To keep the example simple, we'll make the assumption that true classifier recall improves linearly with training examples (in reality, the learning curve is more shoulder-shaped, and has considerable local variability). Then our notional learning curve might look like so:

Notional classifier learning curve.

Notional classifier learning curve.

Here, true effectiveness reaches the threshold after 1500 training examples. But in practice we don't see the actual learning curve, only sample-based estimates of it, which are sometimes higher, sometimes lower than true recall (as we randomly over- or under-sample examples the classifier labels correctly). For simplicity, assume that the likely degree of error is the same at every sampling point (to be precise, assume normally-distributed error with a 0.1 standard deviation). Training and testing in iterations of 50 documents, a production run might look like so:

Notional classifier learning curve, with sample-based estimates of recall.

Notional classifier learning curve, with sample-based estimates of recall.

The production manager sees only the recall estimates (the x's above), and stops the run at the first estimate above the 75% threshold. Each estimate is as likely to overstate as it is to understate recall; but the estimate which first falls above the threshold is more likely to be an overestimate. For the run depicted, the first above-threshold estimate occurs with 1250 training documents; we stop with estimated recall of 77%, but actual recall of 62%.

Repeating the above experiment many times, the probability of stopping at a given actual recall is distributed as follows:

Distribution of final recall estimate for nominal 75% threshold under sequential testing.

Distribution of final recall estimate for nominal 75% threshold under sequential testing.

The mean true stopping recall is 68%, 7% below the threshold of 75%; the probability that we stop with an actual recall at or above the threshold is only 20%.

All three of the sequential testing methods we examine in our paper suffer from a similar, optimistic bias. The bias is no surprise for fixed training, growing test, as this is the classic scenario for sequential methods (though, regrettably, the standard treatments do not trivially extend to composite measures like recall). The cause of bias with fixed test and growing training is less obvious; it lies not in varying random error in the test set, but (what is more difficult to quantify) random variability in the interaction between the fixed test and expanding training sets, as we by chance select training examples that make the classifier perform better or worse on the test set than on the collection as a whole. For the setup explored in our paper, a confidence interval that nominally had a 95% chance of covering the true score value (for us, F1), in fact only had 92% coverage under sequential testing; but the precise degree of optimism will vary depending upon collection, task, frequency of testing, and other factors.

What are the implications of this finding for e-discovery practice? First, and most importantly, a control sample used to guide the producing party's process cannot also be used to provide a statistically valid estimate of that process's result. So if the producing party says "we kept training until our control sample estimated we had 75% recall", you should expect the production actually to have less than 75% recall. The producing party can make various heuristic adjustments for their own internal purposes, and we provide some empirical adjustments in our paper. But if a statistically valid estimate of true production effectiveness is required—and we should think long and hard before we concede that it is not—then a separate certification sample must be made.

6 Responses to “The bias of sequential testing in predictive coding”

  1. gvc says:

    William,

    I am happy to see that you and your colleagues have shed some light on an all-too-common fallacy in statistics -- one that is implicit in a large number of edisco protocols I've seen.

    It should be noted that what you call certification sampling can only be done once, or else the same fallacy applies. If you think that you maybe have achieved 70% recall, and your sample shows otherwise, you can't just do a couple of tweaks and then sample again.

    You can think of this as a sports playoff series, and a sample is a game. You can't just play games until you win one, then give yourself the cup. You can't even double down until you win; i.e. when you lose the first, go for best of three; if you lose two, go for best of five; etc. You have to fix the length of the series (1 game, 7 games, 9 games, etc.) in advance or it isn't fair.

    The bottom line is that you need to have very good reason to think that you are done before you do your final sample.

    cheers,
    Gordon

  2. william says:

    Excellent comment, Gordon, thanks. You're absolutely correct: the certification sample and estimate must be done only once. The producing party therefore needs to estimate not simply, "how good is our production?", but "how confident are we that we will pass the certification sample?"

  3. Bill Dimm says:

    Nicely done article and paper. I was actually just thinking about this issue last week. Doesn't your second graph also suggest a solution, at least in the case of your fourth testing regime, which is to fit a line/curve to the data and use the point where it crosses the threshold as the point estimate for the desired training set size? The uncertainty in the fit parameters would also give you the uncertainty in where the threshold is crossed, so you can find your 95% confidence point. Of course, it is less clear how well that would work in the three regimes contemplated in your paper since errors in adjacent points are not independent.

  4. william says:

    Bill,

    Hi! That is a very interesting idea. Applying it in practice would be complicated by the fact that the learning curve is highly variable, and the sample errors have different variances at different points. One could nevertheless achieve pragmatically useful smoothing from a line fitting exercise such as you suggest. Coming up with a strictly valid statistical estimate, though, would be much trickier.

  5. Ralph Losey says:

    Why do you say this? It seems like the second half of your sentence directly contradicts the first.

    "Each estimate is as likely to overstate as it is to understate recall; but the estimate which first falls above the threshold is more likely to be an overestimate. "

    If it is equal likelihood, why would overestimation be more likely? See my logic problem? It would seem to be equally as likely it would underestimate.

  6. william says:

    Well, to see it in another way, think of it like this. Each estimate is as likely to be an over- as an under-estimate. But over-estimates are more likely to fall above the threshold than under-estimates (because over-estimates are, on average, higher than under-estimates).

Leave a Reply