Who has sighted (not just cited) Belkin 1980, "ASK"?

I posted a little while ago about citing papers that you yourself have not read, for which Rob Hyndman came up with the catchy advice "Sight what you Cite". One cause of such sightless citations is that the original work is inaccessible. I've run across an example of this today: Nick Belkin, "Anomalous states of knowledge as a basis for information retrieval", Canadian Journal of Information Science, 1980, pp 133--143.


I'll first attempt to explain why you would want to cite this paper; if you're not interested in this rather long explanation, then please jump to the subsequent section.

ASK and best-match models

The Anomalous States of Knowledge (ASK) model holds that information-seeking behaviour begins from the user's awareness that their knowledge contains anomalies. ASK is a critical response to an earlier (and still predominant) model, which holds that a user's query is a direct statement of their information need. Under this earlier model, the information retrieval task is to find the document or documents that are the best match with the query, in the sense of being the most similar to it. Indeed, so pervasive is this understanding of IR that the algorithms used to determine which documents should be returned for each query are generally known as "similarity measures", because they are regarded as calculating document--query similarity.

The Anomalous States of Knowledge (ASK) theory replies that, on the contrary, while the user is aware that there is something they don't know but want to, they are not in general aware of precisely what it is they don't know. There is, therefore, a fundamental mismatch between documents and queries. A document is a statement of what its author knows; a query is a statement of what the user does not know. Attempting straightforwardly to match the two up is misguided.

The distinction between the best-match and ASK models may seem only a theoretical one, but it does have important practical consequences. The best-match model tends to portray the user's information need as static, and communicable in a single, well-formed query, an understanding that fits well with batch-mode information retrieval. In contrast, the ASK model acknowledges that the user's information need is generally hazy at the beginning, and is clarified and extended in response to exploring the information space; the model, therefore, presents an understanding of the information retrieval process that is intrinsically iterative and interactive.

The experience of the web supports the ASK model; thanks not least to its hypertext nature, the web strongly supports information exploration. Actual web search is a kind of iterative-batch hybrid: users iterate, but the search tool is essentially still a batch retrieval one, matching queries to documents. This may seem sub-optimal, but as Thomas Carlyle says, humans are tool-using animals; we are adept at using tools to achieve goals that the tool does not intrinsically encapsulate; and web search users have learnt the appropriate effective use of search engines as highly sophisticated keyword-matchers. Indeed, simple, crude tools whose operation is readily comprehensible may well be preferable to complex, highly-tuned tools whose operations are opaque. And if the search experience is a hybrid, the predominant evaluation methods are still resolutely batch-mode; despite a number of attempts to introduce interactivity, mainstream IR system evaluation still consists of assessing the outputs of once-off query runs.

Sighting Belkin

The proceeding was by way of underlining why a cite to Belkin's original paper on the ASK model is something that a researcher might want to make. Looking at the citation, though, one will immediately observe that the Canadian Journal of Information Science seems an obscure journal, as indeed it is. The journal is only available online from 2000 onwards; even the table of contents is not consistently available before that. Nevertheless, Belkin's article is cited 434 times, according to Google Scholar; nearly as often as the 545 citations for the later but more retrievable Belkin, Oddy, and Brooks, "ASK for information retrieval: Part I. Background and theory", Journal of Documentation, 1982, which I have been working off.

So, the first, more utilitarian, question is, does anyone have a digital version of the 1980 paper?

Fundamental papers in obscure places

The second, more interesting question is why it seems to happen so often that highly-cited, fundamental papers get published in such obscure locations. Belkin's paper is only one example of this; there is also Sparck Jones and van Risjbergen's 1975 report on the "ideal" test collection, a number of papers by Stephen Robertson that end up in inaccessible ASLIB proceedings, Ellen Voorhees's talk on the Cranfield paradigm at CLEF 2001 (retrievable for being produced in the Web age, but hardly an archival venue), and so forth. Even within mainstream journals, earlier articles of enduring interest tend to be surprisingly concentrated in nominally lower-ranked journals, such as Information Processing and Management or the Journal of Documentation, than their supposedly higher-impact peers, such as ACM Transactions on Information Systems or the Journal of the American Society for Information Science and Technology.

My surmise is that the obscurer or lower-status venues are more receptive to ideas and arguments that are outside the mainstream or do not fit into the discipline's predominant publication mould. Now of course the majority of these off-beat publications are probably just bad; but the greater freedom of expression and lower barriers to acceptance means that many of the new, original ideas end up here, as well.

A classic example of this process of academic rigidity banishing intellectual innovation in the Information Retrieval sphere is Brin and Page's paper on PageRank. It was rejected from SIGIR for insufficient experimental validation -- an absolute requirement for acceptance in the field -- and instead was published in the (academically) lower-status but less hide-bound proceedings of the WWW conference.

6 Responses to “Who has sighted (not just cited) Belkin 1980, "ASK"?”

  1. jeremy says:

    I haven't "sighted" Belkin 1980, but I have "sighted" Belkin et al 1982:

    http://irgupf.com/2009/03/09/exploration-and-explanation/comment-page-1/#comment-4480

    And I suspect a lot of other people have sighted Belkin 1982 as well, given that it was republished in the 1997 "Readings in Information Retrieval" book (Sparck Jones and Willett, eds.)

    I'm sure that 1982 contains much the exact same theoretical groundings as 1980, so it's not like I/we/the community doesn't understand the ASK hypothesis and are only going on hearsay. But you raise a good point: Why is it that 1980 gets cited so often, rather than the more widely-read 1982? I'll bet it has something to do with researchers wanting to cite the earliest works, lest they get called out for it.

  2. Are you absolutely certain that PageRank was good science?

    Please see one of my older blog post on this topic:

    Is PageRank just good marketing?
    http://www.daniel-lemire.com/blog/archives/2007/11/28/is-pagerank-just-good-marketing/

    Disclaimer: I teach PageRank to all students in my Information Retrieval course. I'm no PageRank basher. But was it really rejected because of how innovative it was?

    Disclaimer 2: I submitted a paper once to SIGIR and, of course, it was rejected. I submitted it to Information Retrieval (the journal) and it was accepted with raving reviews. To this day, I think it was decent work, though I would now rewrite the paper differently. It was certainly an original take on the topic.

  3. william says:

    Daniel,

    It is true that there is some scepticism about the real value of PageRank in the IR academic community. Thanks for the reference to your blog post; I had not been aware of some of the papers cited in it and the comments; I'm not convinced that they demonstrate that PageRank was inferior to other solutions available at the time of its publication, but that's a topic for another time.

    However, PageRank was certainly an interesting and, at the time, novel idea, one that has inspired a lot of other work in the area (such as on HITS). It has prima facie attraction to it. The original paper did not fit into the SIGIR mold in that it lacked experimental validation. Now of course part of the problem is that the existing test collections of the time did not provide the data needed to adequately validate (or falsify) the method. I've discussed this with SIGIR luminaries, and their response has been "well, you can always perform some sort of experimental analysis". But I think this is quite wrong-headed: if a proper experimental validation is not possible, an improper one should not be cobbled together to meet some publication hurdle. And the nature of innovative research is that test collections to validate a new idea do not become publically available until the new idea has already been established.

    As it happens, our CIKM paper is on the topic of going through the motions of empirical validation in SIGIR and CIKM papers without the experiments actually meaning anything.

  4. jeremy says:

    William:

    The original paper did not fit into the SIGIR mold in that it lacked experimental validation. Now of course part of the problem is that the existing test collections of the time did not provide the data needed to adequately validate (or falsify) the method.

    This is true. However, look at what the authors write in the original Google paper:

    However, most of the research on information retrieval systems is on small well controlled homogeneous collections such as collections of scientific papers or news stories on a related topic. Indeed, the primary benchmark for information retrieval, the Text Retrieval Conference [TREC 96], uses a fairly small, well controlled collection for their benchmarks. The "Very Large Corpus" benchmark is only 20GB compared to the 147GB from our crawl of 24 million web pages. Things that work well on TREC often do not produce good results on the web.

    In other words, they already had amassed their own large (147 GB), real life (web-crawled) collection. All they need to do to perform experiments is two things: (1) Come up with 50 queries, (2) Evaluate 2 x 10 x 50 = 1000 pages for relevance.

    At the time, 50 queries was enough to establish statistical significance at a believable level in the SIGIR community. It wouldn't have been that hard to come up with 50 queries. Grab 5 friends and everyone could do 10 each. You could do that in half an hour, if not less.

    Second, why do I say 2 x 10 page relevance evaluations? Because you have to have a baseline. Run your search algorithm without pagerank for each query, and then run it with pagerank. Look at the top 10 results from each (2 x 10). Do that for all 50 queries. (You might even be able to get away with less than 1000 total judgments if there is significant overlap between the sets of documents returned by both systems.)

    In the worst case, and with 5 friends, that's only 200 pages each that you have to look at. That's really not a lot of work, overall. It could have been done. Heck, I wrote my own first search engine around that time (in early 1998, for the IR course at UMass, where I was a grad student) and Bruce Croft and Jamie Callan (who was at UMass at the time) had us do exactly what I'm proposing.. evaluate relevance ourselves for the top 10 docs returned from each query. I think I remember doing about 25 queries x 10 = 250 judgments. Not hard. Took me an afternoon.

    Then, with judgments for PageRank and a non-PageRank baseline, they can compute things like rank of the first relevant hit, Precision @3, @5, @10, etc. It's true that they couldn't have computed recall. But so what? Web engines today still don't compute recall :-) Precision at the top is what matters to most web searchers.

    So they easily could have done all that. Would the experiment have been on a "standardized" test collection? No. Does that matter to the SIGIR community? My experience is no. I see novel ideas all the time for which there are no test collections, and still folks manage to come up with reasonable evaluations. And SIGIR accepts those papers.

    Just my $0.02.

  5. william says:

    Jeremy,

    Hmm, good point. I'll take a step back on this one. The authors could, and probably should, have conducted more rigorous evaluation than the rather anecdotal variant they did (search for "university" on page titles only). SIGIR is not entirely at fault here. Still, in terms of publications, it is an interesting miss.

    William

  6. [...] of information retrieval holds that users do not always know what they are looking for. (See, Who has sighted (not just cited) Belkin 1980, ‘ASK’?) Their information needs are [...]

Leave a Reply