Please, let's do!

]]>Hi! Yes, gosh, that is quite wrong. You can't just propagate the "plus or minus 5%" like that.

Using my offline calculator for recall confidence intervals, the correct interval on recall is [78.9%, 96.2%]; that is, roughly minus 10%, plus 7%.

Effectiveness estimation and reporting is so rife with errors in e-discovery cases at the moment, including recent and ongoing precedent-setting cases (not to name names). Both vendors and lawyers need to seek professional advice on quantitative matters in order to represent their clients defensibly. Given the ubiquity of statistical howlers, the side with such professional advice will enjoy an enormous advantage over their opponents.

William

]]>"From a collection of 1 million documents, 100,000 documents were produced. A sample of 395 of the produced documents was taken, which showed that 80,000 of the produced documents were responsive, plus or minus 5%. A sample of 395 of the withheld documents was taken, which showed that 9,900 of the withheld documents were responsive, plus or minus 5%. Ergo, the recall was 80,000/89,900 = 89%, plus or minus 5%."

Correctly calculated (which, as you know, takes some effort), the margin of error on the recall is more like "plus or minus 50%," ten times worse than claimed, rendering the estimate practically useless.

The key problem with the reasoning above is the "9,900 documents, plus or minus 5%" is a horribly imprecise estimate. This becomes clear when you convert the percentage to the number of documents. What this estimate really says is "9,900 document, plus or minus 45,000 documents." The margin of error is five times as large as the point estimate!

The moral of the story: if you are counting documents, state your point estimate and margin of error in terms of documents.

]]>