RIA Dataset

Alistair passed me a copy of the latest issue of Information Retrieval, which is devoted to reports from the Reliable Information Access workshop. The workshop was run in 2003, and the reports are being published in 2009, so we are not discussing breaking news here. Still, the concept of the workshop was very interesting: invite a dozen leading information retrieval research groups to a six-week, on-site experiment employing seven different retrieval systems, to tackle (broadly speaking) the question of why information retrieval technology is not improving. There were two specific subtasks: an intensive failure analysis of why retrieval systems, individually and collectively, performed poorly on certain topics and not on others; and a multi-dimensional exploration of the effectiveness, limits, commonalities, and differences of pseudo-relevance feedback techniques, one of the few promising general-purpose retrieval techniques that go beyond keyword matching.

A particular feature of the workshop is the online data archive that resulted from it. Again, this may have been up for six years now for all I know, but it is only being publicised now, and I haven’t seen the data widely used (except by RIA participants). The online dataset contains all of the failure analyses, and all (saving a few, perhaps deliberately broken links) of the experimental runs made — almost 2,000 of them. The runs alone are potentially a valuable meta-evaluative resource, albeit of a peculiarly specialised variety, since they are elaborated from a handful of base systems by various permutations and combinations of relevance feedback.

The most striking feature of the dataset, though, is that it is there at all, more or less complete; not just runs, but reports, configurations, notes, and more. Making experimental data readily available is one of the areas where information retrieval, outside of pioneering community efforts like TREC, lags behind other areas of research. The availability and completeness of the dataset could be seen as all the more impressive given the elapsed time between the workshop and the publication of its reports. The trick is that the workshop organizers made a very wise decision: build a website for use in data sharing and report collection during the experiments themselves, and then simply make the collected data (perhaps after some cleaning up) publicly available. Perhaps such research wikis should be more widely employed.

Leave a Reply