## Post-stratification of a binomial population

Retrieval System A returns 100 documents. We sample 20, and find 10 relevant. We therefore estimate that System A's precision is 0.5, and that there are 50 relevant documents in the set System A returned. Let us refer to the number of relevant documents a system returns as the system's yield.

Subsequently, System B returns 50 documents, a subset of those returned by System A. Of these, 5 fall in the set we previously sampled for assessment, and all 5 are relevant. What to do we estimate System B's yield as?

Since every assessed document returned by System B is relevant, we might estimate System B's precision at 1.0, saying that all 50 documents returned by System B are relevant. But then what of the set returned by System A alone? Of these 50 documents, 15 have been assessed, and 5 found relevant. One in three, or 16.6, documents seem relevant in this subset, so that we now estimate System A to have returned 66.6 relevant documents. Without any further assessment, due solely to the post-hoc evaluation of a new system, the estimate of System A's yield has changed. This is undesirable: we don't want to have to restate the scores of all existing systems when we evaluate a new one against a static test collection.

Alternatively, we might work not from sampled proportions, but from sampling probabilities. Each of the documents in the set returned by System A had a 0.2 probability of being sampled. The same probability applies to each document in the subset returned by both System A and System B, since sampling was done without reference to System B. Using the formula that a sampled unit's contribution towards an estimate of a population total is the unit's value divided by its sampling probability, we would give each sampled relevant document a weight of 5, and say that the set returned by System A alone and the set returned by both System A and System B each have 25 relevant documents in them. System A's yield remains unchanged, as we would prefer; but aren't we ignoring the evidence we have about the precision and hence yield of System B?