Sampling with zero intent

A zero intent sample is a sample which will only satisfy our validation goal if no positive examples are found in it. If we have a population (in e-discovery, typically a document set) where one in R instances are positive (one in R documents relevant), and we only want a one in Q probability of sampling a positive instance, then our sample size can be no more than R / Q.

In e-discovery, we're often dealing with collections or collection segments that have very few relevant documents in them. And not infrequently we need to prove that there are indeed very few relevant documents in these segments. For instance, to validate that our production has achieved adequate recall, we need to demonstrate that the number of found relevant documents exceeds the unfound relevant documents in the discard pile by some margin. Now, absolute proof can only be achieved by reviewing all documents in the discard pile (and even then we have to consider reviewer error); but probabilistic proof is achievable using a random sample. And with very large segments and very low richnesses, the proof we require can sometimes be provided only if the sample finds no relevant documents at all. I refer to this as "sampling with zero intent".

In designing zero-intent samples, the most important relationship is that between the sample size s and the richness of the segment (proportion of relevant documents within it) r, under the constraint that we want to have p confidence that the sample will not turn up any relevant documents. Assume p is pre-determined, say at 95%. For a given richness r, what is the largest sample size s that we can (as it were) risk? For a given sample size, what is the highest richness that is safe?

The exact answer to these questions involves solving the equation

    \[  p = (1 - r) ^ s \]

for the desired variable. A simple and reasonably accurate approximation, however, can be found from the observation that, for small r and s \ll 1/r,

    \[ (1 - r) ^ s \approx 1 - r s \]

(by taking the first two terms of the binomial expansion). Therefore:

    \[ 1 - p \approx r s \quad ; \]

a simple formula to solve indeed!

To express this in handy-rule-of-thumb form, let's talk about richness as a rate r ^ {-1} rather than a proportion r. That is, instead of saying "0.01% richness", let's say "1 relevant document for each 10,000 documents". Similarly, let's talk not about the probability of successfully finding to relevant documents in the sample, but instead the rate of failure, q' = 1 / (1 - p). This gives us:

    \[ s = r' / q' \quad  .\]

In words, the maximum sample size is the ratio between the rate of richness and the accepted rate of failure. So, for instance, if we expect that 1 in 10,000 documents are relevant, and we only want a 1 in 20 probability of finding a relevant document in the sample, then our sample size must be no greater than 500 = 10,000 / 20. (Putting this back into the binomial formula, we find the exact probability of failure, ignoring the finite population adjustment, is 4.9%.)

By itself, creating a sample design to avoid finding relevant documents is not that interesting: simply make the sample as small as possible. We should, however, also be quoting some measure of confidence, such as a confidence interval, and that introduces the countervailing incentive to increase the sample size. I hope to discuss this constraint, particularly as it affects recall estimates, in my next post.

Leave a Reply