Comments on: The bias of sequential testing in predictive coding

By: Why training and review (partly) break control sets « Evaluating E-Discovery

Mon, 20 Oct 2014 04:22:38 +0000

[…] be wary about the use of control sets to certify the completeness of a production---besides the sequential testing bias inherent in repeated testing against the one control set, and the fact that control set relevance […]

By: william

william — Wed, 31 Jul 2013 23:54:49 +0000

In reply to Ralph Losey. Well, to see it in another way, think of it like this. Each estimate is as likely to be an over- as an under-estimate. But over-estimates are more likely to fall above the threshold than under-estimates (because over-estimates are, on average, higher than under-estimates).

By: Ralph Losey

Ralph Losey — Wed, 31 Jul 2013 12:35:12 +0000

Why do you say this? It seems like the second half of your sentence directly contradicts the first.

"Each estimate is as likely to overstate as it is to understate recall; but the estimate which first falls above the threshold is more likely to be an overestimate. "

If it is equal likelihood, why would overestimation be more likely? See my logic problem? It would seem to be equally as likely it would underestimate.

By: william

william — Fri, 28 Jun 2013 21:16:33 +0000

Bill,

Hi! That is a very interesting idea. Applying it in practice would be complicated by the fact that the learning curve is highly variable, and the sample errors have different variances at different points. One could nevertheless achieve pragmatically useful smoothing from a line fitting exercise such as you suggest. Coming up with a strictly valid statistical estimate, though, would be much trickier.

By: Bill Dimm

Bill Dimm — Fri, 28 Jun 2013 21:08:08 +0000

Nicely done article and paper. I was actually just thinking about this issue last week. Doesn't your second graph also suggest a solution, at least in the case of your fourth testing regime, which is to fit a line/curve to the data and use the point where it crosses the threshold as the point estimate for the desired training set size? The uncertainty in the fit parameters would also give you the uncertainty in where the threshold is crossed, so you can find your 95% confidence point. Of course, it is less clear how well that would work in the three regimes contemplated in your paper since errors in adjacent points are not independent.

By: william

william — Mon, 24 Jun 2013 16:50:18 +0000

Excellent comment, Gordon, thanks. You're absolutely correct: the certification sample and estimate must be done only once. The producing party therefore needs to estimate not simply, "how good is our production?", but "how confident are we that we will pass the certification sample?"

By: gvc

gvc — Mon, 24 Jun 2013 16:46:34 +0000

William,

I am happy to see that you and your colleagues have shed some light on an all-too-common fallacy in statistics -- one that is implicit in a large number of edisco protocols I've seen.

It should be noted that what you call certification sampling can only be done once, or else the same fallacy applies. If you think that you maybe have achieved 70% recall, and your sample shows otherwise, you can't just do a couple of tweaks and then sample again.

You can think of this as a sports playoff series, and a sample is a game. You can't just play games until you win one, then give yourself the cup. You can't even double down until you win; i.e. when you lose the first, go for best of three; if you lose two, go for best of five; etc. You have to fix the length of the series (1 game, 7 games, 9 games, etc.) in advance or it isn't fair.

The bottom line is that you need to have very good reason to think that you are done before you do your final sample.

cheers,
Gordon