The popular history of web search holds that Google buried its competition in part because its PageRank algorithm solved, for a time, the problem of web spam; and some pundits now assert that not merely Google, but algorithmic search in general, is loosing its battle against the spammers. Spam has attracted much less attention amongst IR researchers, though, than its seeming criticality to operational systems would suggest; none of Google Scholar's top ten results for the search [web spam search results] is from a mainstream IR venue. Part of the reason for this neglect is that most research collections have been from curated or semi-curated sources, and hence are spam free. Academic research tends to pursue the questions that its datasets readily answer.
The Clueweb09 collection, though, introduced at the TREC 2009 Web track, is a crawl of the open web, and hence contains plenty of spam. Indeed, since our own access to the web as internet users is mediated by search engines that put great effort into preventing us from seeing spam, Clueweb and similar collections offer a rare chance for us to see what the wild web that search engines must deal with really looks like, and how the pollution of the web with spam affects search results.
A recent paper by Gord Cormack, Mark Smucker, and Charlie Clarke, of the University of Waterloo, uses the Clueweb09 collection to investigate the impact that spam has on retrieval effectiveness, and to present a brutally simple and surprisingly effective algorithm for single-pass training and filtering of web spam. Cormack and colleagues find that of the top ten results produced by content-only, spam-agnostic retrieval algorithms to a representative sample of web queries, 60\% to 80\% are spam (depending on the curmudgeonliness of the annotator). Not surprisingly, then, adding a spam filter to such a naive system leads to an enormous boost in effectiveness, with estimated precision at scores going from 0.09 with no filtering to over 0.40 with optimal filtering settings. Their naive method plus spam filtering outperforms the best of the TREC 2009 participant systems, and post-hoc spam filtering significantly improves the submitted runs of almost all participants.
Surprisingly, perhaps (or perhaps not), the spam filter used to achieve this dramatic improvement in effectiveness is, in the author's words, "brutally simple", in method, model, and implementation. In method, it is a logistic regression classifier trained in a single, linear pass through the training set; the heart of the classifier is a simple one-line formula backed up by a one-dimensional array of log odds. The implementation is worth the price of the download in itself: a complete classifier in half a printed page of C code (and yes, it is printed in the paper). Most astonishing of all, though, is the simplicity of the feature model. There is no link analysis, no page quality metrics or layout analysis, no complex linguistic analysis; not even the most basic linguistic analysis of parsing the text into terms. The only features are the raw, overlapping four-byte sequences of the original text, HTML, headers, and all.
The Waterloo classifier has gained somewhat mythic status in certain circles. Its two dozen lines of C, plus Gord Cormac making a rapid sequence of training classifications, achieved performance equal to or exceeding that of commercial e-discovery systems backed by extensive manual review in the interactive task of the TREC 2009 Legal Track. Having struggled with building a much more complex, and much less effective, classifier for the subsequent year's track, it is both refreshing and humbling to see how simple the Waterloo classifier is. I'd recommend computer scientists with machine-learning envy to have a look at, and play around with, it. And also, the paper itself is an easy, rollicking read.