EvaluatIR: An Online Tool for Evaluating and Comparing IR Systems

The information retrieval research community has a strong tradition of empirical evaluation, stretching back to Cyril Cleverdon's experiments in the library of the Cranfield Aeronautical College in the late 1950s and early 1960s. The Cranfield studies established both the community's empirical tradition and its standard experimental methodology -- all the more remarkable because the studies predate computerization, having been performed with index cards and manual retrieval! The Cranfield methodology has as its main apparatus a fixed test collection, made up of a document corpus, a set of queries, and judgments as to which documents are relevant to which queries. Early test collections were quite small; the Cranfield corpus was made up of only 1,400 abstracts. Also, different groups tended to use different collections, making it difficult to compare results and measure progress in IR science over time.

The problems of scale and standardization were addressed by the Text REtrieval Conference or TREC, founded in 1992. Sponsored by the US government through the National Institute of Standards and Technology (NIST), and involving research groups from around the world, TREC has the resources to develop large-scale, publically-available test collections, and the prominence to promote these collections as the standard resources for public IR experimental evaluation. TREC also introduced a novel component -- the participation of dozens of research teams and their systems in competitive retrieval experiments. The annual TREC experiments became the proving ground for many new retrieval techniques. They also performed an important role in the formation of the test collections themselves. With corpora holding millions of documents, it is obviously impractical to assess every document for relevance to every query. Instead, the top-ranking results returned by the experimental systems were used to form a pool of documents likely to be relevant. Only this pool is judged, cutting down assessment effort to practicable levels while still maintaining acceptable reusability of the collections.

TREC has provided standard, large scale collections, and these collections are widely used in research; a good dozen papers at each SIGIR conference report results on TREC collections. However, although these standard collections allow for longitudinal comparison of retrieval results over time, this is in fact rarely done. The reason is the lack of accessibility of the necessary information. A researcher, reviewer, or reader wishing to compare with previous published results must manually search through past papers -- a time consuming task. Even the scores of the original TREC runs take some effort to retrieve. There is no standard mechanism at all for storing and retrieving results that are not part of a published paper. And where mean scores are available, per-query scores almost invariably are not, preventing a researcher from checking their results for statistical significance against established benchmarks. This means that researchers are thrown back upon reimplementing old baselines, while it is time consuming for readers and reviewers to compare reported baselines and claimed improvements against existing achievements.

The lack of longitudinal analysis of IR results is a serious deficiency in research method, and a strange one, given that the hard work of forming the standard collections and providing a reference set of runs against them has already been done by TREC. To remedy this deficiency, we at the University of Melbourne have developed EvaluatIR.org (and by "we", I mean Tim Armstrong). EvaluatIR.org is an online database of runs made against standard test collections -- initially just the TREC ones, but more will be added later. We have populated the site with the original TREC runs, and a number of runs that we have made against the TREC collections with publically available retrieval systems. Already the site is a useful tool for comparing published and experimental results against the TREC reference runs. Our goal is for it to become a standard place for researchers to upload published runs, and possibly unpublished ones too. More importantly, by making previous results achieved against a test collection available by following a couple of links on EvaluatIR.org, rather than searching through dozens of phone-book size conference proceedings, we hope to encourage a research culture of comparing new results against established reference points, and thus truly ensuring cumulative and measurable progress in IR technology.

We will be running a demo of EvaluatIR at this year's SIGIR conference. You can have a look at our extended abstract for the demo. But best of all, try out the site itself. We welcome comment on the site and on our ambitions for it. And once it is finished, we strongly encourage researchers to upload their runs to it. We hope that EvaluatIR will become a standard resource for the community, and continue information retrieval's strong tradition of empirical research.

Leave a Reply