My PhD thesis was passed (actually a few months ago), and I've placed it online. The core of the research material has been published elsewhere, but there are a few updates:
Score standardization (Chapter 4): The chapter on standardization combines Score Standardization for Robust Comparison of Retrieval Systems (ADCS 2007) and Score Standardization for Inter-Collection Comparison of Retrieval Systems (SIGIR 2008). I've added a components of variance analysis (Section 4.1). The variance in a matrix of system by topic scores can be divided in a system component (due to difference in system quality), a topic component (due to difference in topic difficulty), and a system-topic effect (due to some systems doing better on some topics than others). In comparative evaluation, we want to measure the system component; the system--topic interaction can be of analytical interest; but the topic component is just noise. The effect of standardization is to remove the topic component, and slightly reduce the system--topic interaction. The former makes unpaired comparisons much stronger; the latter marginally improves (I think) paired comparisons, too.
Components of variance analysis is a useful tool, one that summarizes information about score stability and comparability more handily and transparently than discriminative power, swap rate, and various other ad-hoc solutions found in the field. Evangelos Kanoulas and Jay Aslam use a similar components of variance method as an objective function in their CIKM 2009 paper, Empirical Justification of the Gain and Discount Function for nDCG, although they (I think unnecessarily) also pull in the number of topics as a moderating factor, in line with Generalizability Theory.
There are still open questions for standardization. One issue is the effect of outliers, both in the reference and the standardized systems. Another is the unrepresentative standardization factors that too narrow a set of reference systems can give you. Other researchers have suggested the use of more robust statistics, such as the median and inter-quartile range rather than the mean and standard deviation, along with smoothing priors. I'd like to say this is "for future work", but I'm not sure that I have the energy to come back to this future work myself.
Statistical Power in Retrieval Evaluation (Chapter 5) The chapter on statistical power is based on our CIKM 2008 paper, Statistical Power in Retrieval Experimentation. That paper was a bit of a rolic (romp+frolic), enabled by my then innocence of sequential analysis. Having in the meantime lost my innocence of this sixty-year-old field of statistics, I've toned down my methodological recommendations in the thesis chapter, and spend more time examining the power of the standard 50-topic TREC collection. The conclusion: as Sparck-Jones and van Rijsbergen said back in 1975, "less than 75 [topics] are of no real value; 250 are minimally acceptable".
Score Adjustment for Pooling Bias (Chapter 6) Chapter six is based on our SIGIR 2009 paper, Score Adjustment for Correction of Pooling Bias. I have added made a concerted (and, I hope, successful) effort to render this, one of my most impenetrable papers, a little more palatable.
A Similarity Measure for Indefinite Rankings (Chapter 7) This chapter is taken from the Rank-Biased Overlap (RBO) paper, published in ACMTOIS, Volume 28 (November 2010). The chapter is very close to the paper.
Remainder I was quite happy with how the historical background chapter worked out, although it illustrates the desirability of writing history only after all the people involved have died: I make some generalizations that contemporaries might find sweeping. Still, the history of the field is interesting, and (I suspect) under-appreciated. How many IR practitioners realize, for instance, that the SMART system of the 1960s had a grammar parser, and supported queries on sentence structure (only to find that it didn't improve effectiveness)?
I'm also pleased, on a lesser note, with my treatment of kernel density estimates in Section 3.3.7, particularly the use of density reflection to handle sharp boundaries like the [0, 1] limits of most IR metrics (something which R, for instance, doesn't handle by default).