The Cranfield tests

I thought that, since I've written recently about the history of the phrase "the Cranfield paradigm", I should write something about the Cranfield tests themselves. I'm not going to give a full history, as this can be found in Karen Sparck-Jones's account, Gerard Salton's reflections (published version here), or indeed the original (readable, if highly detailed) reports themselves. Here, I want to draw out a few salient points.

The Cranfield tests consisted of a series of experiments run over more then a decade, starting with some preliminary investigations in 1953, and continuing on into the late 1960s. Within this period, there were two main stages. The first of these ran from 1957 to 1961, and is commonly called Cranfield 1; and the second, Cranfield 2, ran from 1963 to 1966.

The routine description of modern, TREC-style experiments as following the Cranfield methodology can obscure the fact that the Cranfield experiments were quite different, almost alien, from what we are used to in IR today. Most obviously, the experiments were not computerised. Instead, they were carried out manually -- human-made indexes and human-executed retrieval over them.

Lack of automation aside, the Cranfield experiments were not by design retrieval experiments in the sense that we understand them today. Rather, they were indexing experiments. The report on Cranfield 1 calls it "an investigation into the comparative efficiency of indexing systems", while Cranfield 2 is described as "tests on index language devices". Specifically, Cranfield 1 compared four different indexing languages. Then Cranfield 2 examined in detail what components (devices) of an indexing language made it more effective.

This focus on indexing languages takes some effort for today's IR researcher to understand; indeed, even the term "index language" is unfamiliar. For us, the form of indexing is not an important component in retrieval effectiveness; one simply records which documents each term appears in, and where within those documents it occurs. All the real work is done at the query processing and document retrieval stages.

However, at the time it was the dogma of information science and its applied field, librarianship, that sophisticated indexing and classification schemes, applied by trained experts, were necessary for effective retrieval. (At this point, it will probably be helpful for the reader to think of an "index" not as an electronic file on a computer disk, but as a physical card catalogue.) For instance, the Universal Decimal Code, one of the languages tested at Cranfield 1, was an extension of the Dewey Decimal system, with a hierarchical code that classified every field, sub-field, and topic: "aluminium alloys", say, had the UDC code 669.715, while "fatigue testing" was coded as 620.178.3.

The Cranfield tests therefore were based on the assumption that the really crucial step in the whole information management and retrieval process lay in how the index was constructed. However, the results of the tests contradicted these assumptions in quite unforeseen, and even unsettling, ways.

Cranfield 1 found that, in fact, there was little difference in effectiveness between different indexing languages; if anything the simpler the indexing method, the better the retrieval. This unexpected result inspired in Cranfield 2 a closer examination of precisely what components of an indexing language boosted retrieval effectiveness. There were three base indexing languages: independent ("single") terms; multi-term concepts; and a thesaurus-based controlled language. Each of these was elaborated with a number of indexing devices, to create a total of 33 different index languages (four of them based on title or abstract terms).

The result of the Cranfield 2 experiments on these composite languages was even more striking than Cranfield 1: single terms outperformed concept-based and controlled languages, while amongst the single term language variants, those with simpler or no devices marginally outperformed those with the more complicated, supposedly more sophisticated techniques. This was a very surprising, even shocking result to the experimenters, as Cleverdon notes in his report (v2 page 252):

Quite the most astonishing and seemingly inexplicable conclusion that arises from the project is that the single term index languages are superior to any other type. ... This conclusion is so controversial and so unexpected that it is bound to throw considerable doubt on the methods which have been used to obtain these results ... A complete recheck has failed to reveal any discrepancies, and ... there is no other course except to attempt to explain the results which seem to offend against every canon on which we were trained as librarians.

The finding that complex, controlled, classificatory indexing schemes were not necessary for effective retrieval is the most important theoretical result of the Cranfield experiments. Indeed, part of the reason that the aims of the Cranfield experiments seem so alien to us today is because Cranfield served to overthrow the ancien regime of controlled indexing, whose assumptions it had set out to probe. Important corroboration, from a similar background of ungrounded assumptions, came from the contemporary SMART experiments: assured by linguists that devices such as thesauri and syntactic analysis were essential to effective automated retrieval, the SMART project set out to explore how such linguistic devices could be implemented, only to find that in practice they gave no benefit in retrieval effectiveness. And Cranfield's findings are of no little significance, either: if sophisticated, intelligent, and trustworthy classification and indexing were necessary for effective retrieval, then search engines could hardly exist as we know them today, and the world of information would be a much different place than what it is.

The influence of the Cranfield experiments in establishing a strong experimental methodology in information retrieval is widely recognised, although given the wide divergence between Cranfield's assumptions and methods and those of today, this methodological influence is perhaps more by way of inspiring example than direct institution. Less obviously, but perhaps as importantly, Cranfield founded information retrieval's stubborn empiricism by itself being a triumph of the empirical method over pre-existing dogma: a tradition that can be traced all the way back to Aristotle that information management and access required careful classification and indexing was unintentionally demolished by a few years of experimental assessment. This empirical shock can be seen as a crucial factor in the emergence of empirically-focused information retrieval as a distinct discipline from speculatively-minded information science.

The method of the demolition of previous dogma can be seen as starting the thread on a third strand of empiricism, but this time a more tangled one. The Cranfield experiments expended great effort on identifying and experimentally delineating different explanatory and predictive variables, in the form of index language devices amongst other things, then on testing their effect on the retrieval outcome -- the sort of science that some would like to see the field of information retrieval embrace today.1 The results of all this effort were, however, disappointing: devices did not have the effect predicted, index languages were more different in their theory than in their results, and the simplest and most straightforward methods seemed as good as any other.

In these results we can see the birth of the black box model of information retrieval systems: the input (documents and queries) and output (retrieved information) can be observed, but no strong explanatory or predictive model of what goes on inside is made. Different retrieval models such as the vector space model and the language model do, it is true, have theories behind them, but these theories have only weak explanatory and almost no predictive content. Instead, the field works largely through hunches, heuristics, and rules of thumb. Naturally, in this sort of environment, empirical demonstration is paramount: without theory, all we have is experiments. But the experiments themselves, not adapting to changes in theory, readily become stereotyped.

Theory and methodology are connected. In this post, I have focused on Cranfield's impact on the theory of information retrieval. More commonly, the tests are cited for their influence on methodology. It is generally in the latter sense that "the Cranfield paradigm" is spoken of. But is it appropriate to speak of a paradigm when the model is solely that of an experimental method? I intend to examine this in a later post.

1 My interpretation differs from that of Stephen Robertson. Robertson treats the Cranfield experiments as being intentionally non-predictive and non-explanatory, focused only on relevance as a measure of success. However, I see the careful delineation of different index language devices as being the first (as it happened, thwarted) step towards the identification and manipulation of experimental variables. For instance, Cleverdon hypothesises that certain devices will aid precision, others recall.

2 Responses to “The Cranfield tests”

  1. [...] Webber recently wrote an interesting analysis of the reports of the original Cranfield experiments that were so [...]

  2. [...] mentioned in previous posts (and it is not an idea original to me) that the predominant model in information retrieval research [...]

Leave a Reply