Evaluating keyword search in databases

I was recently invited to contribute a short article to a special issue of the IEEE Data Engineering Bulletin on keyword search in databases. Since database keyword search is not an area I have worked on previously, I decided the most worthwhile contribution I could make was to survey evaluation practice in the area, and compare it to what is done in mainstream information retrieval.

Keyword search in databases is not what I originally thought it was. That is, it is not full-text search on individual text fields. Rather, the search is done over the content of whole objects in the database, in the form of (joined) tuples. And since there are many ways that tuples can be joined in a database, one of the immediate questions is how to realize these joins, what level of object granularity to consider for retrieval, and what effect tuple structure has upon the similarity score matching queries to results.

The field of keyword search on databases (or, more generally, on structured data) is still relatively new; the initial work in the area was done in 2002. And one of the aspects of the field that is still being developed is its evaluation resources and methodology. As I describe in my article, different research groups develop their own test sets. A few structured collections are commonly used, such as DBLP. But almost invariably, each research group specifies their own retrieval task, creates their own test queries (or selects them from a log), and performs their own evaluation. Perhaps as a result, a common pattern is that each paper reports almost perfect effectiveness, doubling or tripling that of the previous method -- which itself had achieved almost perfect effectiveness, doubling or tripling that of its predecessor.

As I suggest in my article, such evaluation results suggest a lack of objectivity in the evaluation process. In particular, it is likely that if a research group has set out to solve one particular problem in keyword search (for instance, matching of keywords to schema terms), then they are likely to formulate test tasks that illustrate the problem they are trying to solve. Such targetted evaluations have their place, but what is needed in addition is a common set of retrieval tasks, intended to be representative of general retrieval needs, which serves as a neutral benchmark against which each group's methods can be tested. That is to say, what is needed is a public test collection, the core evaluation tool in IR.

Leave a Reply