Correcting retrieval scores by ratio estimation

Search systems are evaluated by submitting queries to them and assessing the relevance of the documents they return. These relevance assessments (qrels, in the jargon) are expensive to create, though the expense can be amortised by reusing them in future experiments. However, subsequent runs on the same queries, either by a different system or by a modified version of the original one, may return new, and therefore unassessed, documents. We would prefer not to have to assess these new documents for every query after every run -- and ideally we would prefer not to have to assess them at all. Instead, standard practice is to assume that unassessed documents are irrelevant, or to excise them from the ranking; but the first method is biased against the new run, while the latter is biased in favour of it.

In our SIGIR 2009 paper, Score Adjustment for Correction of Pooling Bias, Laurence Park and I treat the problem of correcting the evaluation bias against (or for) new systems in the presence of unassessed documents as an instance of ratio estimation. The idea of ratio estimation is simple and attractive: correct the error of a large but approximate survey by performing a detailed analysis of a subset of the results. The detailed analysis gives us an idea of the ratio between the true and approximate values in our survey, which can be used to adjust our broader results. Ideally, we get the accuracy of the exact method with the efficiency of the approximate one.

For system evaluation, the approximate values are the scores achieved on each query without full relevance assessment, while the detailed analysis of a subset takes the form of fully assessing certain queries. Assume, for instance, we are evaluating a new retrieval method against 1,000 queries for which we have some existing relevance assessments. We randomly select, say, 20 of these queries, and assess all the previously unseen documents returned by our new system. For these 20 queries, we are able to calculate the system's true score, as well as the score it would have achieved with the fresh documents unassessed. The difference between true and incomplete scores can then be added to the (approximate) scores the system achieves on the 980 other, partially-assessed queries. In this way, the bias of having unassessed documents is corrected.

Ratio estimation is an appealing idea, but it only helps us if our estimates are consistently slanted either high or low from the true values, and additionally if the approximation errors are less variable than the true scores. Fortunately, in the case of unassessed documents, the error is uniformly in one direction or the other -- too low if we assume unassessed documents are irrelevant, too high if we exclude them -- and true scores are highly variable between different queries. Because of this, the method we propose significantly decreases the error in the mean scores of new systems, reducing it to as little as a quarter of the original amount with only 20 fully-assessed queries. This is a satisfying result for such a simple and robust approach.

Of course, a more complicated inference could be attempted. Our method is particularly obtuse in that it applies the same adjustment to the score for every query, no matter how many documents are unassessed. If we only care about mean scores, this does not matter. If, however, we want to estimate the true scores for individual queries, then a more precise adjustment would be desirable. The relationship between number of documents unassessed and the degree of adjustment to make is not linear, though. We might make twice the adjustment for two unassessed documents than for one; but if the system returns an anomalously large number of unassessed documents, it is evidence that the system is doing badly on that query. One could go a step further still, for instance by attempting to infer the relevance of each unjudged document. But our approach is attractive because of its simplicity, the fact that it requires inferential assumptions no stronger than those of basic sampling theory, and its robustness.

One Response to “Correcting retrieval scores by ratio estimation”

  1. [...] the SIGIR 2004 paper is a very influential one, the initiator of a whole line of work (including some of my own) on dealing with incomplete relevance judgments during system evaluation. It is likely, therefore, [...]

Leave a Reply