Ad-hoc retrieval: measurably going nowhere

For our CIKM paper this year, we surveyed the results of information retrieval experiments on TREC collections reported over the last decade. We found that there has been no demonstrated improvement in ad-hoc retrieval performance over this time. We also found that researchers are publishing results that fall well below the best TREC runs, and using baselines that are below even the median TREC systems -- a combination which allows them to continually report statistically significant improvements without actually getting any better. Given this, how proud should the information retrieval community really be of its much-vaunted empirical rigour?

The TREC effort has provided researchers working in information retrieval with standard test collections and methodologies. In principle, standardization allows results to be compared between groups and across time. In practice, however, systematic comparisons are confined to TREC itself. The annual TREC experiments reveal the performance of different systems against a single test collection at a particular point in time. But there is no longitudinal analysis, at TREC or elsewhere, of how retrieval techniques perform against that collection subsequently. It is true that many results are published each year against historical TREC collections. But authors rarely compare their scores against previously published results, or even against the original TREC systems; and no-one is collecting these published scores. So it is difficult to answer questions like "is retrieval technology improving over time?" or "which is the most effective retrieval technique?", even though these would seem the most obvious and important questions that standard test environments should allow us to answer.

To address the question of the recent progress of retrieval technology, and to observe what use was being made of the TREC collections outside of TREC itself, we collated all results reported against TREC adhoc-style collections in SIGIR since 1998, and CIKM since 2004, arranging them by the collection they were achieved on and the year they were reported. Our findings are contained in, "Improvements That Don't Add Up: Ad-Hoc Retrieval Results Since 1998", which is to be presented at CIKM in November of this year.

The following figure for the TREC 8 AdHoc collection gives the gist of our findings; another eleven collections are analysed in the paper, with similar results.

Scores reported on the TREC 8 AdHoc test collection

Scores reported on the TREC 8 AdHoc test collection

We found that there has been no consistent improvement in the retrieval scores reported on ad-hoc collections over the past decade. Indeed, for some collections, the mean score in the first five years is higher the mean score in the second. Very few published results beat the score of the best original TREC system, even where that system was run almost a decade before. Experimental baselines, too, do not improve over time, and are generally uncompetitive, often falling below the median of the original TREC systems.

There are a couple of points that arise from our study. First, it seems that ad-hoc retrieval technology has not improved over the past decade, at least not as measured on the TREC collections. This finding is not entirely unexpected; the ad-hoc track of TREC was discontinued after TREC 8 in 1999 because it was felt that retrieval performance had reached a plateau. Apparently, we are still travelling along that plateau. Why we are doing so -- why no progress is being made (or at least demonstrated) in ad-hoc retrieval -- is another question. Retrieval results are hardly perfect. Does there need to be a fundamental breakthrough in technology? Or are the results observed in fact the best that can be achieved, given the inherent ambiguity and underspecification of the retrieval task as presented in the TREC collections?

A second question raised by the study concerns the SIGIR review and publication process itself. The IR community prides itself on its insistence upon the experimental validation of theoretical results. Essentially all of the papers surveyed (and there are over a hundred of them) performed experimental validation, and over a third claimed statistically significant improvements over their chosen baselines. Yet almost none of the methods published actually led to a demonstrated improvement over existing state of the art; that is, over the best TREC system. Where statistical significance was achieved, it was generally over a weak, vanilla baseline. What function, then, is experimental validation performing here? Are researchers going through the motions, and reviewers giving them the requisite tick for doing so? Or worse, is there a perverse selection bias going on here, where researchers who use competitive baselines fail to achieve significant improvements and so have their papers rejected (or indeed don't even submit their results in the first place), whereas those using weaker baselines achieve significance more readily and have their papers accepted?

14 Responses to “Ad-hoc retrieval: measurably going nowhere”

  1. Jon says:

    Great post -- I'm looking forward to reading the full paper.

    Without having actually having read it yet, however, there are several reasons not to compare to the best-performing group's numbers from TREC.

    The best performing groups at TREC often use some specialized techniques (see our Wikipedia query expansion @ the 2007 Blog track) to do well at TREC. Participants frequently take a kitchen-sink approach to their submissions, focusing on doing well rather than constructing a well-controlled experiment. This often works really well if you want to do well at TREC, but does muddy the waters when trying to evaluate what works and what doesn't in a particular set of submissions. For this reason, using a simplified baseline for something like a SIGIR submission is appropriate -- and results in a more easily understandable paper, with clear indications of features or techniques that are actually helpful.

  2. I think it's a natural consequence of the No IR Research Left Behind act! :-)

    Seriously, I'm reminded of a presentation NIST's Donna Harman delivered at RIAO 2007, entitled "It's Time To Move On", in which she showed an old slide illustrating a plateau in the performance of TREC systems. I interpreted it as a need to move beyond TREC and embrace HCIR, but I realize others might have derived different conclusions.

  3. FD says:

    Very nice work

    I agree with Jon's points.

    Additionally,

    1. For various reasons, researchers often take subsets of the official TREC corpora (e.g. Robust-FR). Just making sure you accounted for this in the analysis.

    2. Along the lines of Jon's comment, although previous runs may have higher performance, they may also have had more manually tuned parameters than more modern systems. I don't think that this means abandoning the sort of comparison you conducted. However, there may be cases where performance is equivalent (or lower) and you adopt the simpler model.

    3. I encourage any (general) algorithm for reranking of baseline results to use evaluatIR to demonstrate generalizability to realistic baselines. I assume you can download the actual run files from here. I remember there was a hoop to jump through at some point to get them from NIST.

  4. william says:

    Thanks to all for the comments, which are much appreciated; with the sort of meta-analysis that we are attempting in this paper, understanding how others both interpret the analysis from above, and perceive the individual inputs from below, is very valuable. I'll respond in parts below.

  5. william says:

    On Ferdinand's Fernando's point 1, about variants on the standard TREC test collections:

    Yes, we were careful to note when researchers were using variant collections, and we recorded them separately. One of the amusing second-order findings of the survey was the enormous number of different variant collections that were in fact used. In the 106 papers that met our survey criteria, 83 different variant TREC collections were used -- almost as many collections as papers! We discuss this in the second-last paragraph of Section 3; see also Figure 2. Most of the variants were slicing and dicing the TIPSTER corpus and associated topic sets; in some cases, to increase the number of topics (which required excluding parts of the document corpus that were not shared between topic sets); in other cases, to reduce the document corpus size to make computationally intensive methods practical.

  6. william says:

    On Ferdinand's Fernando's point 3, the use of EvaluatIR:

    We certainly take the results observed here as an argument for using EvaluatIR or a similar service that centrally records and reports retrieval results. We face the practical issue now of whether and how to incorporate results reported in previous publications, for which we don't have per-topic scores, let alone the runs themselves.

    For the TREC runs, no, it is not possible to download them from EvaluatIR; they can only be obtained from NIST based upon signing an agreement regarding their use. We have also, at NIST's request, disabled serving out per-topic scores for TREC systems on EvaluatIR, although calculations based on these scores (such as significance tests) are available, and per-topic scores are still shown on graphs. The issue is with permissions of use; we'd have to individually ask the groups that submitting to TREC whether we can make their runs available, and we haven't gotten around to doing that.

    That said, we do use EvaluatIR ourselves extensively simply as a convenient way of looking up the (per-collection) scores achieved on different collections by TREC systems.

  7. william says:

    On Jon and Ferdinand's Fernando's point on why the best TREC systems are not reasonable baselines, because of the high degree of manually tuning etc. that goes into them:

    This is the really important question that arises out of the paper. What is a reasonable baseline? Similarly, how much effort should researchers have to put into improving their baseline systems, before they implement and test their innovation?

    The argument we make in the paper is to note that a method which improves some vanilla system might not improve, or indeed might even harm, a state-of-the-art one. In the paper, we do a proof-of-concept experiment to demonstrate this, based on toggling options for Indri. Therefore, if the experimenter's claim is that they are presenting a method that improves retrieval performance over what is currently known, then they have an obligation to test it on state-of-the-art systems. Or, alternatively (and this I think is the really superior method), they need to test their method on a number of different baselines, to demonstrate how generally beneficial it is; but here too, their mix of systems must include ones that implement state-of-the-art techniques. These baselines should, for instance, include methods such as query expansion and proximity operators, which have been shown to lead to improved effectiveness, at least on some collections. It seems that instead what people tend to do is to take something like plain BM25 as their baseline.

    It could also be noted that while the best TREC runs may be subject to extensive manual tuning, they do face several disadvantages compared to subsequent experimental runs. First, TREC systems don't have access to the topics in advance, whereas subsequent runs do. (Yes, I know that direct use should not be made of this advantage, but I believe that it does infer an indirect advantage). Second, TREC systems don't have the knowledge of which methods have actually worked on a particular collection, which subsequent systems do have.

    All of that said, I'm prepared to grant that the best TREC system may not always be an appropriate baseline. But if so, shouldn't the best (or at least an improved) previously published method be the correct baseline choice? The fact that neither baselines nor "improved" scores trend up over time indicates that this is not the experimental approach that has been adopted -- and that whatever approach it is that has been adopted, that approach is not leading to an improvement in effectiveness.

  8. Jon says:

    s/Ferdinand/Fernando/g

  9. Sérgio Nunes says:

    Very interesting post and discussion.

    I tend to view TREC as an "evaluation platform" for researchers not as a competition. However, I agree that a global perspective on the progress over the years is needed - maybe NIST could play a role in this...

  10. [...] demonstrating long-term gains in system performance. This interesting analysis is summarized in a blog post by William [...]

  11. Le Zhao says:

    Interesting paper & discussion.

    There are actually two questions relevant to the discussion,

    1. How should a baseline be chosen,

    A baseline is to help the experimenter to understand what worked, and why it worked. So I believe all baselines should be customized to the particular task under study. E.g. all ad hoc retrieval models should be compared with BM25, Dirichlet LM; all phrase models should be compared with dependency model; all pseudo relevance feedback models should be compared with Rocchio and relevance model.

    Now come the second question,
    2. what if the guideline is not followed, or how to ensure that standard.

    Well, it's really a moral question, with practical means to ensure the quality of papers.

    Generally it's the authors' responsibility to make an effort to create a reasonable baseline, and it's the reviewers' responsibility to assess whether that result is believable, and whether it will work for the state of the art.

    But there will always be cases where such guidelines are not followed, the paper *can* still be accepted if its contribution is believed to be significant by the reviewer. However, I believe the proportion of such work in conferences should be kept low.

  12. [...] do complex work, the quality of that work can be measured, and progress made. (Of course sometimes progress isn’t cumulative, but that’s a different [...]

  13. [...] by x%”) rather than look at novel problems (see #1 above). However, while doing so, there is no evidence of long-term, cumulative progress in decades of publications. On the other hand, I continue to miss [...]

  14. [...] that significance has occurred because of a weakness in the experimental setup (for instance, choosing a weak baseline in the evaluation of an IR retrieval technique), or even conscious manipulation of results by the [...]

Leave a Reply