Well, to see it in another way, think of it like this. Each estimate is as likely to be an over- as an under-estimate. But over-estimates are more likely to fall above the threshold than under-estimates (because over-estimates are, on average, higher than under-estimates).

]]>"Each estimate is as likely to overstate as it is to understate recall; but the estimate which first falls above the threshold is more likely to be an overestimate. "

If it is equal likelihood, why would overestimation be more likely? See my logic problem? It would seem to be equally as likely it would underestimate.

]]>Hi! That is a very interesting idea. Applying it in practice would be complicated by the fact that the learning curve is highly variable, and the sample errors have different variances at different points. One could nevertheless achieve pragmatically useful smoothing from a line fitting exercise such as you suggest. Coming up with a strictly valid statistical estimate, though, would be much trickier.

]]>I am happy to see that you and your colleagues have shed some light on an all-too-common fallacy in statistics -- one that is implicit in a large number of edisco protocols I've seen.

It should be noted that what you call certification sampling can only be done once, or else the same fallacy applies. If you think that you maybe have achieved 70% recall, and your sample shows otherwise, you can't just do a couple of tweaks and then sample again.

You can think of this as a sports playoff series, and a sample is a game. You can't just play games until you win one, then give yourself the cup. You can't even double down until you win; i.e. when you lose the first, go for best of three; if you lose two, go for best of five; etc. You have to fix the length of the series (1 game, 7 games, 9 games, etc.) in advance or it isn't fair.

The bottom line is that you need to have very good reason to think that you are done before you do your final sample.

cheers,

Gordon