My PhD thesis was passed (actually a few months ago), and I've placed it online. The core of the research material has been published elsewhere, but there are a few updates:

**Score standardization (Chapter 4)**: The chapter on standardization combines Score Standardization for Robust Comparison of Retrieval Systems (ADCS 2007) and Score Standardization for Inter-Collection Comparison of Retrieval Systems (SIGIR 2008). I've added a components of variance analysis (Section 4.1). The variance in a matrix of system by topic scores can be divided in a system component (due to difference in system quality), a topic component (due to difference in topic difficulty), and a system-topic effect (due to some systems doing better on some topics than others). In comparative evaluation, we want to measure the system component; the system--topic interaction can be of analytical interest; but the topic component is just noise. The effect of standardization is to remove the topic component, and slightly reduce the system--topic interaction. The former makes unpaired comparisons much stronger; the latter marginally improves (I think) paired comparisons, too.

Components of variance analysis is a useful tool, one that summarizes information about score stability and comparability more handily and transparently than discriminative power, swap rate, and various other ad-hoc solutions found in the field. Evangelos Kanoulas and Jay Aslam use a similar components of variance method as an objective function in their CIKM 2009 paper, Empirical Justification of the Gain and Discount Function for nDCG, although they (I think unnecessarily) also pull in the number of topics as a moderating factor, in line with Generalizability Theory.

There are still open questions for standardization. One issue is the effect of outliers, both in the reference and the standardized systems. Another is the unrepresentative standardization factors that too narrow a set of reference systems can give you. Other researchers have suggested the use of more robust statistics, such as the median and inter-quartile range rather than the mean and standard deviation, along with smoothing priors. I'd like to say this is "for future work", but I'm not sure that I have the energy to come back to this future work myself.

**Statistical Power in Retrieval Evaluation (Chapter 5)** The chapter on statistical power is based on our CIKM 2008 paper, Statistical Power in Retrieval Experimentation. That paper was a bit of a rolic (romp+frolic), enabled by my then innocence of sequential analysis. Having in the meantime lost my innocence of this sixty-year-old field of statistics, I've toned down my methodological recommendations in the thesis chapter, and spend more time examining the power of the standard 50-topic TREC collection. The conclusion: as Sparck-Jones and van Rijsbergen said back in 1975, "less than 75 [topics] are of no real value; 250 are minimally acceptable".

**Score Adjustment for Pooling Bias (Chapter 6)** Chapter six is based on our SIGIR 2009 paper, Score Adjustment for Correction of Pooling Bias. I have added made a concerted (and, I hope, successful) effort to render this, one of my most impenetrable papers, a little more palatable.

**A Similarity Measure for Indefinite Rankings (Chapter 7)** This chapter is taken from the Rank-Biased Overlap (RBO) paper, published in ACMTOIS, Volume 28 (November 2010). The chapter is very close to the paper.

**Remainder** I was quite happy with how the historical background chapter worked out, although it illustrates the desirability of writing history only after all the people involved have died: I make some generalizations that contemporaries might find sweeping. Still, the history of the field is interesting, and (I suspect) under-appreciated. How many IR practitioners realize, for instance, that the SMART system of the 1960s had a grammar parser, and supported queries on sentence structure (only to find that it didn't improve effectiveness)?

I'm also pleased, on a lesser note, with my treatment of kernel density estimates in Section 3.3.7, particularly the use of density reflection to handle sharp boundaries like the [0, 1] limits of most IR metrics (something which R, for instance, doesn't handle by default).

Congratulations! Looking forward to having the material all in one place. Chapter 2 looks great.

Incidentally, I've always found the reflection method to be sort of a hack for all but a few applications. Is there a better justification for IR metrics than "it's simpler compared to alternatives"? It may not matter in practice. Just curious.

Congratulations!

There's certainly a wealth of literature on different forms of boundary adjustment for kernel density estimates. Another approach is to use special, asymmetric kernels as you approach the boundary. I spent a fair bit of time reading about them, then decided it was too much work, especially since I was using KDEs purely as a presentational format, without trying to do inference on them; and the reflection method is really simple!

In terms of what would be most appropriate for score distributions in IR -- well, I guess you need to establish an objective standard, and then measure different approximations against that standard. If you had access to a very large number of effectiveness scores -- if, say, you worked at an operational search engine that performed large-scale, assessment-based evaluation -- then I guess you'd take your full set of scores (perhaps smoothed with some random jittering), then sub-sample, apply your KDE and suite of boundary adjustment methods, and see which one gave the least departure from the "true" distribution. I wonder which would make a bigger difference in this case, your base kernel or your adjustment method?

Hi William,

My heartfelt congratulations on acceptance of your thesis! It looks like a very interesting read.

Hi William,

Can you provide me the derivation of how you derived eqn 32 from eqn 23. I tried several things but unable to derive eqn 32

Ritesh

I should have mentioned that I am talking about the RBO paper "A similarity measure for indefinite ranking"

Ritesh,

Hi! I wouldn't try to derive it from Equation 23. Instead, you should go back to the derivation of Equation 23 from Equation 7 and Equation 9 with X_k / k instead of X_k / d. The expression $((1 - p) / p) \sum_{d=k+1}^\infty p^d (X_k / k)$ goes simply to $X_k/k \cdot p^k$ using the geometric series. The rightmost term of Equation 32 is the generalization of $X_k / k$ for uneven lists. Then the left term is simply summing up the per-ranking weighted overlaps from rank 1 to the depth of the longer list, adjusting for the gap between shorter and longer lists.

Sorry, I don't have my original working in front of me. If this is still unclear, I'll have a go at reworking it over the weekend and put the working up somewhere for you.

William

Hi William,

Thanks. Actually I was able to derive much of the equation. The only part that's unclear to me is how you compute A_l. The last part of equation 32 is ((X_l - X_s)/l + X_s/s). As I understand, it is extrapolating average intersection beyond the last element of the longer list. However I am not able to understand how did you get that. By the way I really enjoyed the paper. It was long but really interesting.

Thanks

Ritesh,

Hi! Sorry for taking such a long time to reply: holidays and deadlines intervened.

Consider the following two uneven, incomplete lists:

S:

L:

X_1 is 1, since . Therefore A_1 is 1 / 1 = 1. X_2 is also 1, making A_2 be 1 / 2 = 0.5. What though is X_3? We know that "b" is a member of {a, b}, but is "?" a member of {a, c, b}? We don't know. Instead, we give it a putative membership of A_2; that is, of 0.5. Therefore, X_3 is 1 + (1 + 0.5)/2 = 1.75. And A_3 is 1.75 / 3 = 0.583. This then works out to the formula shown in Equation 32 (I hope! -- at least it does every time I recreate the working for this formula).