Jeff Dalton beat me to the punch with his overview of the SIGIR Workshop on the Future of Information Retrieval Evaluation. Like Jeff, I'm interested in Mark Smucker's suggestion of a collaborative evaluation task that places human performance prediction at the centre of IR evaluation. This is an approach that I've been impressed with in the work of Andrew Turpin and Falk Scholer on the relationship between system metrics and user experience (warning: ACM DL subscribers only; come on guys, ACM allows authors to put up their own versions, don't be lazy!): rather than simply asking users nebulous questions like "how satisfied were you with your search results?", actually give them a task to perform and see how well they perform it.
However, is TREC really the forum for such a task, as Mark is proposing? Much as I enjoy visiting suburban Gaithersburg's fine malls and bus stations, I'm not sure that collaborative work like this still needs the framework of official sponsorship, a yearly cycle, and an annual, physical conference -- which is a polite way of saying that I'm sure it doesn't. Perhaps we should marry Mark's proposal with the suggestion at the same workshop by Wei Che Huang, Andrew Trotman, and Shlomo Geva that evaluation tasks be run continuously and online (and our own modest proposal to similar effect). We could even incorporate the parallel investigations of Omar Alonso and Stefano Mizzaro on the one hand, and Gabriella Kazai and Natasa Milic-Frayling on the other, into using crowdsourcing as a substitute for traditional assessors. Task-based assessment would seem well-suited to crowdsourcing: at least in some circumstances, you already know what the correct answer is, so cleaning hitter data is less of an issue, and frequently tasks are asking a "how quickly?" question, which fits nicely with the Turker's natural desire to complete tasks as quickly as possible.