Citing papers that you've never read -- or that were never written

A regrettably common practice in academic writing is to cite papers because someone else cites those papers, without having read them yourself. Justin warns about this in his book on writing for computer science, and I know he personally has a few good anecdotes. The practice is particularly amusing when the paper is incorrectly referenced or simply doesn't exist; the original citer has made a mistake, and this has been blindly carried forward by their imitators. Via Edel Garcia comes The Most Influential Paper Gerard Salton Never Wrote, an article by David Dubin tracing the history of the vector space model as applied to the field of information retrieval. In this article, Dubin points out that a highly cited paper, "A Vector Space Model for Information Retrieval", published by Gerard Salton in 1975 in the Journal of the American Society for Information Science, does not in fact exist.

Not wanting to be hoisted on my own methodological petard, I verified Dubin's claim that there is no article of that title in JASIS for that year, though there is one by Salton et al. titled "A theory of term importance in automatic text analysis" in Issue 1 of that year's volume. There is a paper by Salton et al. in the November 1975 issue of Communications of the ACM with the similar title of "A vector space model for automatic indexing", but CACM is not the journal that usually gets cited, and of course even if it were, the title is still not correct. (Both these articles are noted by Dubin.) Nevertheless, the non-existent article is cited 215 times according to Google Scholar. Amusingly, the earliest citations that Google reports are from papers that Salton is co-author of, published around the time of Salton's death: Amit Singhal and Gerard Salton, Automatic Text Browsing Using Vector Space Model (IEEE Dual-Use Technologies and Applications Conference, 1995), and Amit Singhal, Gerard Salton, Mandar Mitra, and Chris Buckley, Document Length Normalization (Information Processing and Management, 1996). The latter is itself a highly cited paper (672 citations, according to Google Scholar), because it introduces the widely-used pivoted cosine variant of the vector space model. It is possible therefore that the mistaken citation was first made at this time, and copied forward from there. Of course, Google Scholar itself may be biased towards later articles, because they are more likely to be in digital format. I haven't been able to locate phantom cites prior to 1995 in a very brief look through earlier literature; I'd be interested to hear if anyone knows of any.

I take it from Dubin that part of the reason this particular misciting is so pervasive is that authors need such an article to exist, and to have been written by Gerard Salton around this time. A certain understanding of the vector space model has come to be accepted as the early method for calculating document-query similarity (essentially, each of the n terms in the vocabulary becomes a dimension in n dimensional space; documents are represented as points or vectors in this space; and the nearness of these vectors is used to measure the similarity of the documents). Gerard Salton was a pioneer in statistical approaches to document-query retrieval. Anyone discussing the history of statistical IR wants a handy cite on the vector space model, and feels the cite should be to Gerard Salton. So a phantom paper by Salton called "A Vector Space Model for Information Retrieval" fits the bill perfectly.

There is a broader issue here to do with citation practice. An author of an academic paper needs to refer to a variety of ideas or facts (the origin of a retrieval model; the average length of a search query; the frequency distribution of a phenomenon) that are not central to the main point of their paper, but need to be stated for background or motivational reasons. The author feels that these facts are (or should be) generally known, or that the general idea is what matters, not the exact details. Nevertheless, such facts cannot simply be stated; they require a citation. (And, of course, a longer list of references rarely hurts a paper's chance of getting accepted.) So an assortment of such handy cites gets collected and handed on, like a useful tool box, from one researcher to another.

I should stress that there is much more to Dubin's essay than the points I have mentioned here. Dubin's argument is not just that the phantom article doesn't exist, but that the understanding of the history of the vector space model that requires such an article to exist is itself mistaken. According to Dubin, vector spaces were first introduced by Salton and collaborators as a handy way of visualising and describing similarity computations; that other papers misunderstood Salton as proposing the vector space model as a formal model of information retrieval, and criticised it on that basis; and that it was in response to those criticisms that Salton finally, in the 1980s, explicitly described the vector space model as it is now understood. The lesson is that sometimes, not only are the citations phantoms, but so too is the "common knowledge" that underlies them. So if you are writing about the vector space model, which paper should you cite? You'll have to read Dubin's essay yourself, and make up your own mind! I highly recommend it as an interesting blend of formal analysis, intellectual history, and original criticism.

Returning to the topic of phantom citations, Eamonn Keogh has a couple of slides (slides 112 and 113) about citing papers you haven't read, including a misspelling of a collaborator's name that has been carried on in later citations. Anyone know of other examples?

17 Responses to “Citing papers that you've never read -- or that were never written”

  1. Shane Culpepper says:

    It is an interesting paper. However, there is (at least) one citation missing. I do not consider myself an expert on the VSM, but the following paper is not in Dubin's discussion, and one I often use. It is probably not the full VSM as we know it now, but its a pretty close predecessor. Have a look at page 25 of the following:

    @article{sl68-jacm,
    author = {G. Salton and M. E. Lesk},
    title = {Computer evaluation of indexing and text
    processing.},
    journal = {Journal of the ACM},
    month = {January},
    year = {1968},
    volume = {15},
    number = {1},
    pages = {8--36},
    annote = {Early vector space model.},
    }

    The argument that you should not cite papers you haven't read generally holds, but what about papers and books you just can't get your hands on anymore? For instance, I have cited Zip's "Human behavior and principle of least effort" and it has always bothered me. I tried for sometime to get my hands on a copy, but I couldn't find it anywhere. I never actually got to even skim the original.

    Two other papers which I have wasted huge amounts of time trying to get (unsuccessfully) were by A. Markov and V. I. Levenshtein. Both are highly cited, seminal papers. There's a really good chance they do exist, but I have never seen them with my own eyes. I'm willing to bet very few people have actually been able to read the original. They are both in Russian, so I'm not sure if having them would be beneficial or not.

  2. Shane Culpepper says:

    Opps. Zip should be Zipf. Just to make sure I'm not crazy, I tried to find the missing references mentioned. Amazingly, I just managed to find the elusive Levenshtein edit distance paper, and it is in English! The paper can be found at:

    http://sascha.geekheim.de/wp-content/uploads/2006/04/levenshtein.pdf

    You better get it before its gone! Still no luck with the others.

  3. Enro says:

    There is plenty of work on this topic of erroneous citations, I am aware of at least three:
    - Richard Dawkins discusses in a footnote of "The Selfish Gene" an error in the references to Hamilton's 1966 paper "The Genetical Theory of Social Behaviour" (instead of "The Genetical Evolution of Social Behaviour") and tracked it down as an example of a replicating "meme"
    - Simkin and Roychowdhury used citations errors in the physics literature to estimate that only 20% of citers read the original article they cite
    - Jean Lobry listed in his thesis (Fig. 1.6) all the variations on Jacques Monod's article "Recherches sur la croissance des cultures bactériennes" found in the literature, represented as mutations

  4. egarcia says:

    I love this one: Entries with same reference titles with English derivatives like "behavior" vs. "behaviour", "modeling" vs. "modelling" and the like. Since such tokens can create different records in a digital library (no collision) at least one version must be spurious.

    According to WorldCat, no papers, but books are associated to the above Zipf reference. See http://www.worldcat.org/identities/lccn-no2004-44803

    A search on worldcat http://www.worldcat.org/search?q=%27Human+behavior+and+the+principle+of+least+effort&qt=owc_search
    returned 9 books and 1 relevant article, but the later is not from Zipf.

  5. william says:

    Enro,

    Hi! Thanks for the references; very amusing. I must admit, though, that I don't fully understand the Simkin and Roychowdury paper. First, they assume that all possible page miscitations are equally likely, when it would seem that certain miscitations are more likely than others (off-by-one, swapping digits). This is important because they use this assumption to justify their conclusion that if the same page misciting is done more than once, the second misciting must be a copy. Figure 1 is particularly suspicious: the most frequent miscite has 78 occurences, whereas their model predicts 13; if you reduce the 78 to 13 then the copy rate goes from about 80% to 65%. There could be several people who make the common mistake of saying, say, "the paper is 12 pages long; I know the first page is 141; so the last page is 153". Second, I don't understand how they extrapolate the copy rate from the miscitations to the correct citations; there are 200 of the former, 4,100 of the latter. Third, they implicitly claim that the result here which applies to a highly-cited, old paper, can be carried over to all papers in general.

    I did a bit of a literature search, but while several people cite Simkin and Roychowdury, I couldn't find anyone really discussing or explaining the result.

  6. william says:

    Shane,

    Thanks for the Salton reference; I'll have a closer read of it. (I'm actually in the middle of reading through the early SMART reports.)

    It was in fact Zipf that I had in mind when I referred to "the distribution of a phenomenon". As it happens, the University of Melbourne library has "Human behaviour and the principle of least effort", so you can go, flick through, and no longer feel guilty for citing it!

    The Zipf citation does make an interesting case study, though. The Zipfian distribution is of course very widely referred to, and frequently when people do so, they give a cite to Zipf's book. But what do they mean when they do so? One reason for citing is to say, "if you want to know more about this topic, go and read this reference"; but then, if the reference is largely unavailable, it's not a very helpful citation to point the reader to (besides which, citing a 573 page book on social psychology in order to explain a mathematical distribution is a bit obtuse). A second reason for citing is to acknowledge that an idea is not original to you, but I doubt people will suspect that John Q. Author is trying to claim credit for the Zipf distribution. A third reason for citing is to say "I'm not going to prove, justify, or give evidence for this point here; if you want a proof, read this reference". Citing Zipf's book would be justified in this case if you were making the claim that the phenomena he states as following his distribution in fact do so, but hardly if you just want to use the distribution. A fourth reason is because you are citing the item as part of an historical overview; but you would only do this here if you were writing a history of the discover of distributions (or of social psychology, or of famous citations), but in this case, you should surely have read the book. The final reason, of course, is because reviewers expect you to cite it; in this case, there is I guess not much you can do about it!

    This all came up for me recently because I'm collecting historical materials for my background chapter on IR evaluation. One of the most highly cited references here is the report by Sparck Jones and van Risjbergen proposing the construction of an "ideal" test collection, which is often referred to as part of the background of the TREC effort. This report, however, is extremely rare; there is apparently no copy in Australia, nor is there a copy online. Fortunately, Mark Sanderson had a copy. If I hadn't been able to get a copy, though, I wouldn't have cited the report; I'd would have had to content myself with citing someone else's description of the report (and hope they had actually read it!)

  7. Shane Culpepper says:

    Wow. I'm positive I checked the library holdings for Zipf's book when I was stationed up the street. I now feel obligated to come for a visit and read it so that I can cite it guilt free.

  8. [...] Citing papers that you’ve never read — or that were never written « IREvalEtAl [...]

  9. [...] William Webber points out that there are cited papers that have never even been written! One famous paper has 215 citations on Google scholar despite the fact that it doesn’t exist. And at least one of those citations is by the author of the non-existent paper! [...]

  10. Shane Culpepper says:

    I came up to Baillieu and had a look at Zipf's book (again). Upon pulling it from the stack, I had a moment of deja vu. I have indeed read through portions of it before but had completely blocked it from my memory! I did not find the book particularly interesting this time either which might be why I've forgotten it.

    On the other hand, this whole exercise also lead to finding the elusive Levenshtein distance paper. I have never cited the paper, despite having a section on approximate pattern matching and edit distance in my thesis. I thoroughly enjoyed reading the original even if the content is covered in many textbooks and now commonplace enough to not explicitly require a citation. Perhaps this is another plausible reason such citations occur. Sometimes a paper or book might be a little difficult to get, but worth the effort.

  11. william says:

    Yes, often the original expositions are among the best. For instance, Kendall (1948)'s description of Kendall's tau and other rank correlation measures is very readable, as are the original Cranfield reports.

  12. [...] This post was Twitted by tek_news [...]

  13. [...] Academic writing – Citing papers that you’ve never read or were never written. [...]

  14. foobar says:

    Guess: The paper was to be published, was referenced as such, but never was. You may find it as a tech report. Happens quite often.

  15. [...] Shared Citing papers that you’ve never read, or that were never written [...]

  16. [...] learned: Publication lists are full of typos. Even though I use software tools to manage lists of papers and citations from Google Scholar or [...]

  17. Emilio Pisanty says:

    One foundational paper in the quantum mechanical study of the phase of a light beam is
    L Susskind and J Glogower. Quantum mechanical phase and time operator. Physics, 1:49, 1964,
    which gets cited in just about every paper on the subject, even if they don't actually discuss the formalism. The thing is, since the journal only lived from 1964 to 68, this paper is just about impossible to get, with only a handful of libraries (which if anyone's interested can be found on WorldCat) carrying the journal. (Also, the journal is close to impossible to google due to an unfortunate pre-internet-age name.) I can't tell for sure, but I'm not certain everyone who cites it has read it.

Leave a Reply