The session track of TREC

This year's TREC conference had several interesting sessions, and not the least interesting of them were the planning sessions for next year's tracks. The design of a collaborative retrieval task, and of the methods and measures for evaluating such a task, can provoke a more wide-ranging, philosophical discussion than the presentation of retrieval results, or the description of a narrowly technical research outcome.

One such interesting design discussion occurred in the planning meeting for the Session Track, which is entering its second year at TREC, under the direction of Ben Carterette and Evangelos Kanoulas (EDIT 20/12/2010: and Paul Clough and Mark Sanderson). As Ben and Evangelos expressed it in their overview, evaluation in IR is bifurcated into two extremes: the once-off, contextless, test collection evaluation of the Cranfield tradition, narrow but repeatable; and the open, flexible, but expensive and non-repeatable evaluation of user studies. The goal of the Session Track is to bridge the system--user gap, starting in small steps from the Cranfield end of the spectrum. Specifically, the track aims to evaluate retrieval in response not simply to an ad-hoc, contextless user query, but to a query that takes place in the context of a session; and, what is more difficult, to design the evaluation and testset in a way that is repeatable and reusable.

The approach in the first year was to provide participants with not just a single query for each information need, but also with a follow-up reformulation of that query: a generalization, a specialization, or else parallel or drifting reformulation. The track participant is then to produce three result sets: a response to the original query; a response to the reformulated query in isolation; and then a response to the reformulated query in light of the original. Evaluation similarly involves assessing the two runs independently, and then in combination. Combined evaluation uses a metric such as session nDCG; one which assesses the second run in terms of what it adds to the results of the first one.

As was pointed out at the results summary session for the track, this evaluation methodology has some problems. One problem was observed in the run report of my colleague at the University of Melbourne, Vo Ngoc Anh. In developing his run (which I was not involved with), he hypothesized that the user reformulated their query because they were disatisfied with the results of the original formulation. Therefore, in Anh's submission, documents returned in the first result (including below the top ten) were actively deprecated in the second. As it turned out, Anh's first run was at or near the strongest in amongst the participants, but his second and combined runs demonstrated a sharp drop in effectiveness. In the event, not only the assessment of relevance, but also the construction of the session nDCG metric, ran precisely contrary to his assumptions: assessor and metric evaluated the second run not as an alternative to, but a continuation of the first. But if that were the case, why would the user reformulate?

Contradictions in the treatment of the second response are emblematic of a broader problem with the design of the track as it stands. This problem is that the reformulation of the second query is independent of the results retrieved to the first. The response to the first query could be gibberish, or it could be an excellent answer to that first query; the retrieval system could say "I'm a teapot, I'm a teapot, I'm a teapot", over and over; it could even provide in advance an oracular response to the query's reformulation; no matter: the reformulated query will be asked regardless.

Two solutions to the problem of an invariant refomulated query were proposed. The first was a rather ambitious plan to calculate a probability distribution over possible reformulations given the response to the original. The second solution was much more straightforward: capture original query, response, and reformulation, then have the test system response to the reformulation in the context of the original query and its canned response. This method can readily be extended to requiring a response to the nth query given queries 1 through n - 1 and their responses.

More interesting than the particular proposals for the next iteration of the track, though, was the debate over the value of the track itself. Two main objections were raised by the audience. The first was that even with the addition of the context of original query and response, the setup failed to capture enough of the richness of a true user's context; and, by extension, that richer retrieval environments simply couldn't be captured by a test collection. And the second objection was that the session track was trying to solve, with very limited and data, a problem that search engines were tackling with much more resources and data, and presumably with some success.

Despite the force of these objections, the Session Track still seems to me an interesting and worthwhile project. Evaluation by test collection is deeply ingrained in the community, for both good and not so good reasons, and as a result it is frequently the case that the test set that frames a problem comes first, while serious consideration of the problem follows later. An example of this is the recent, and overdue, interest in result diversification. The idea that the best response to an ambiguous query is a diversity of results was pointed out long ago, and is obvious enough once raised; but it took the introduction of the Diversity Task of the Web Track and its data set to really concentrate the community's attention on the problem. Even so seemingly small a change in evaluation setup has led to a fruitful re-evaluation of a range of existing questions in IR, from evaluation metrics to the use of pseudo-relevance feedback, from query difficulty prediction to topic and sense identification. (Indeed, a good way of thinking of new research ideas is to consider, "what existing results need to be reconsidered in the context of query diversity?") In addition, history has shown that well-formed test collections are employed for an enormous variety of tasks beyond that which they were originally designed for. The session track, however limited its aims may seem in anticipation, and however difficult its task of capturing user context, has the potential for a similar profound impact on the field.

6 Responses to “The session track of TREC”

  1. [...] 14, 2010 by lifidea Leave a Comment William Webber’s post on TREC Session Track got me thinking about what it means to evaluate a user’s session, or an [...]

  2. Evangelos Kanoulas says:

    First thank you for the comprehensive overview and summary of the discussions at TREC on the session track. You are absolutely right to say that a fixed reformulation independent of the returned results is not the right thing in the construction of a test collection. Nevertheless I'd like to make two comments over the issue. First, I think that a fixed, independent reformulation is still up to a point useful in testing whether one can enhance the retrieval performance by considering the user's past queries (without any assumption as of what led to this reformulation). Maybe this is not possible but nevertheless it is worth investigating. Second, the question of "why a user reformulates" is complicated. I'm not expecting reformulations to be only due to a list of non-relevant documents. Thus, any assumptions should in general be well thought.

    Having said that I certainly agree that the session track needs to be carefully designed and we should learn from this years mistakes so we won't repeat them. We better make new ones.

  3. william says:


    Hi! Thanks for your remarks. They raise an important question: what experimental tasks can the 2010 collection be re-used for? Jinyoung Kim has a helpful figure in which he suggests that the only missing link is between the first response and the reformulation. The absence of this link prevents the system from reasoning about what the reformulation says about the original results. Nevertheless, all other links are present, so that the system is able to reason about how to respond in its second response to the overall information need provided by the original and reformulated query. The system can also do so in a way that is not redundant to the first response--although I'm sceptical that this can be done more effectively than simply by not returning documents that have already been seen.

    If the task to which the collection is suited is responding to the overall information need as realized after the reformulation, then perhaps we need a metric other than session nDCG, which places most emphasis upon the original response. Perhaps in fact in re-using the collection we should drop the first response altogether, and just give the system the original and reformulated query. But how much better can the system do in such a circumstance than simply responding directly to the reformulated query?

  4. Danny Calegari says:

    Don't you mean "the Cranfield paradigm"?

  5. Evangelos Kanoulas says:

    Clearly one could disambiguate queries. Here is a quick example by Nick Craswell, "US holidays" -> "thanksgiving" -> "turkey". But indeed it is not clear at all how and when should one use past queries. In any case any research over the 2010 collection should be done with no assumptions regarding the first response. O.w. any conclusions may be an artifact of the way we created the sessions.

  6. Jinyoung Kim says:

    I believe the absence of first response would not be a big problem given that systems can still do interesting things using the relationship between original and reformulated queries. As Evangelos pointed out, disambiguation is an obvious example, and I think this can be related to several papers on search personalization where they tried to exploit previous queries to improve current queries (e.g. Predicting Short-term Interests Using Activity-Based Search Contexts -- Paul Bennett et al. in CIKM'11)

    Even if the response for initial query by 'some retrieval system' is given to participants, it's not clear how they might use the information. As one possible option, I can imagine that they can compare their output on second query and the given response for initial query to infer the intention behind reformulation. However, it may not be easy considering that the first response is from different system from the participant's.

Leave a Reply