Spam filtering for web results

The popular history of web search holds that Google buried its competition in part because its PageRank algorithm solved, for a time, the problem of web spam; and some pundits now assert that not merely Google, but algorithmic search in general, is loosing its battle against the spammers. Spam has attracted much less attention amongst IR researchers, though, than its seeming criticality to operational systems would suggest; none of Google Scholar's top ten results for the search [web spam search results] is from a mainstream IR venue. Part of the reason for this neglect is that most research collections have been from curated or semi-curated sources, and hence are spam free. Academic research tends to pursue the questions that its datasets readily answer.

The Clueweb09 collection, though, introduced at the TREC 2009 Web track, is a crawl of the open web, and hence contains plenty of spam. Indeed, since our own access to the web as internet users is mediated by search engines that put great effort into preventing us from seeing spam, Clueweb and similar collections offer a rare chance for us to see what the wild web that search engines must deal with really looks like, and how the pollution of the web with spam affects search results.

A recent paper by Gord Cormack, Mark Smucker, and Charlie Clarke, of the University of Waterloo, uses the Clueweb09 collection to investigate the impact that spam has on retrieval effectiveness, and to present a brutally simple and surprisingly effective algorithm for single-pass training and filtering of web spam. Cormack and colleagues find that of the top ten results produced by content-only, spam-agnostic retrieval algorithms to a representative sample of web queries, 60\% to 80\% are spam (depending on the curmudgeonliness of the annotator). Not surprisingly, then, adding a spam filter to such a naive system leads to an enormous boost in effectiveness, with estimated precision at scores going from 0.09 with no filtering to over 0.40 with optimal filtering settings. Their naive method plus spam filtering outperforms the best of the TREC 2009 participant systems, and post-hoc spam filtering significantly improves the submitted runs of almost all participants.

Surprisingly, perhaps (or perhaps not), the spam filter used to achieve this dramatic improvement in effectiveness is, in the author's words, "brutally simple", in method, model, and implementation. In method, it is a logistic regression classifier trained in a single, linear pass through the training set; the heart of the classifier is a simple one-line formula backed up by a one-dimensional array of log odds. The implementation is worth the price of the download in itself: a complete classifier in half a printed page of C code (and yes, it is printed in the paper). Most astonishing of all, though, is the simplicity of the feature model. There is no link analysis, no page quality metrics or layout analysis, no complex linguistic analysis; not even the most basic linguistic analysis of parsing the text into terms. The only features are the raw, overlapping four-byte sequences of the original text, HTML, headers, and all.

The Waterloo classifier has gained somewhat mythic status in certain circles. Its two dozen lines of C, plus Gord Cormac making a rapid sequence of training classifications, achieved performance equal to or exceeding that of commercial e-discovery systems backed by extensive manual review in the interactive task of the TREC 2009 Legal Track. Having struggled with building a much more complex, and much less effective, classifier for the subsequent year's track, it is both refreshing and humbling to see how simple the Waterloo classifier is. I'd recommend computer scientists with machine-learning envy to have a look at, and play around with, it. And also, the paper itself is an easy, rollicking read.

2 Responses to “Spam filtering for web results”

  1. Itman says:

    I agree that it was a stupefying result. Most of all, I would not expect that letter quad-grams would be such powerful predictors. However, further verifications are necessary to understand how well it works on other collections.

  2. That's a fascinating paper, and a cute piece of
    code.

    First, the code as printed at "http://arxiv.org/pdf/1004.5168v1.pdf"
    has a printing error. Both lines which read

    h = b %

    should read

    h = b % P;

    That's the "hash" step.

    So why does this work? It's simple enough. It looks at succesive
    sequences of four bytes (not even characters for Unicode), hashes
    them into about a million buckets, and computes some simple stats.
    What common factor do spam pages have which trips such a filter?

    That's easy. Ads. Every page with a Google ad has something
    like this:

    google_ad_client = "pub-5268721215000735";
    google_ad_slot = "4222970967";
    google_ad_width = 250;
    google_ad_height = 250;

    Each ad vendor has some stylized form of advertising code, which is
    cut and pasted with minor modifications into each page with ads.
    That's exactly the sort of thing a 4-byte recognizer will pick up.
    Comparable results could probably be obtained by counting links.

    This is similar to signature-based virus recognition, which was
    the state of the art in anti-virus tools about a decade ago. Virus
    writers have since advanced to "polymorphic" viruses, where each
    copy is different. Modern viruses use encryption, compression, and
    randomization to spread their bit patterns over a wide space to
    prevent simple signature recognition from detecting them. The
    same techniques could be used on spam web pages.

    John Nagle
    SiteTruth

Leave a Reply