Having an article in submission sharpens one's appreciation for situations to which the article applies; and the ACM's leisurely eighteen-month-plus turnaround on journal articles gives plenty of time for that appreciation to develop. Since early 2009, it has seemed to me that every third retrieval paper either has, or should have, compared non-conjoint rankings for similarity. The most common instance of such rankings are the document lists produced by different retrieval methods, but numerous other examples appear: the frequency-ordered vocabulary of different corpora; the most-common queries from different logs or periods; the weighted connections of different individuals in a social network; and so forth. Indeed, large domain rankings having the features of top-weightedness, non-conjointness, and arbitrary truncation -- what I term indefinite rankings -- seem more common than the unweighted, conjoint, complete rankings assumed by traditional rank similarity measures such as Kendall's . And yet there are few if any suitable measures described in the literature for comparing such rankings.
In our article, A Similarity Measure for Indefinite Rankings published (finally!) in the November 2010 edition of ACM TOIS (author's version here), we propose a new similarity metric for just such rankings: rank-biased overlap (RBO). RBO is based upon cumulative set overlap, a framework we argue to be more suited to non-conjoint rankings than the notion of correlation employed in most other rank similarity measures. The set overlap at each rank is weighted by a geometric sequence, providing both top-weightedness and convergence. Convergence means that a prefix comparison sets bounds on an indefinite one -- an important characteristic in an environment in which the similarity across ranking pairs of highly varying lengths (such as search results) is to be compared and summarized. The resulting metric implements a simple probabilistic model (though it takes some rather involved algebra to do so). The model is of an observer who scans down the two lists with constant marginal but decreasing absolute patience. Then, when the observer's patience runs out, they randomly select an element from one of the lists, and see if it is in the other. RBO gives the probability that the described behaviour produces an element that occurs in both rankings.
As part of our experimental validation, we ran more than 100 queries daily for several months against eight different web search engines (plus three localizations thereof), then used RBO to analyze the similarity of the results. It turned out that we ran this experiment at the last period during which it was possible on this scale; most of the engines we used have since disappeared or become derivatives of more successful rivals, with a couple of the engines falling off their perch during our study itself.
The ACM's gentlemanly disdain for unseemly haste in casting new ideas before an unready public also allows plenty of opportunity, in the interval between final revision and publication, for other work to be published on the same topic. Just such a publication, appearing at WWW 2010, too late for us to cite, is Visualizing Differences in Web Search Algorithms using the Expected Weighted Hoeffding Distance, by Mingxuan Sun, Guy Lebanon, and Kevyn Collins-Thompson. The EWHD metric is top-weighted and handles non-conjointness; it is unclear to me, however, whether the metric is monotonic in depth, and hence whether it is appropriate for comparisons between ranking pairs of different depths. Sun, Lebanon, and Collins-Thompson do, though, propose an interesting criterion for choosing between rank metrics (always a thorny topic), namely the metric's suitability for use in clustering visualization. (Ravi Kumar and Sergei Vassilvitskii also proposed a new rank similarity measure at the same conference, in Generalized Distance between Ranking, but their measure does not handle non-conjointness.)
I have put together an implementation of RBO (in C). The implementation is somewhat unpolished, although efficient. I hope to add other rank similarity metrics in future; included so far are only Kendall's tau (in an implementation, compared to R's ) and a metric we called Average Overlap, developed in parallel by two different research groups (under different names). Comments and suggestions are, as ever, welcome.