A regrettably common practice in academic writing is to cite papers because someone else cites those papers, without having read them yourself. Justin warns about this in his book on writing for computer science, and I know he personally has a few good anecdotes. The practice is particularly amusing when the paper is incorrectly referenced or simply doesn't exist; the original citer has made a mistake, and this has been blindly carried forward by their imitators. Via Edel Garcia comes The Most Influential Paper Gerard Salton Never Wrote, an article by David Dubin tracing the history of the vector space model as applied to the field of information retrieval. In this article, Dubin points out that a highly cited paper, "A Vector Space Model for Information Retrieval", published by Gerard Salton in 1975 in the Journal of the American Society for Information Science, does not in fact exist.
Not wanting to be hoisted on my own methodological petard, I verified Dubin's claim that there is no article of that title in JASIS for that year, though there is one by Salton et al. titled "A theory of term importance in automatic text analysis" in Issue 1 of that year's volume. There is a paper by Salton et al. in the November 1975 issue of Communications of the ACM with the similar title of "A vector space model for automatic indexing", but CACM is not the journal that usually gets cited, and of course even if it were, the title is still not correct. (Both these articles are noted by Dubin.) Nevertheless, the non-existent article is cited 215 times according to Google Scholar. Amusingly, the earliest citations that Google reports are from papers that Salton is co-author of, published around the time of Salton's death: Amit Singhal and Gerard Salton, Automatic Text Browsing Using Vector Space Model (IEEE Dual-Use Technologies and Applications Conference, 1995), and Amit Singhal, Gerard Salton, Mandar Mitra, and Chris Buckley, Document Length Normalization (Information Processing and Management, 1996). The latter is itself a highly cited paper (672 citations, according to Google Scholar), because it introduces the widely-used pivoted cosine variant of the vector space model. It is possible therefore that the mistaken citation was first made at this time, and copied forward from there. Of course, Google Scholar itself may be biased towards later articles, because they are more likely to be in digital format. I haven't been able to locate phantom cites prior to 1995 in a very brief look through earlier literature; I'd be interested to hear if anyone knows of any.
I take it from Dubin that part of the reason this particular misciting is so pervasive is that authors need such an article to exist, and to have been written by Gerard Salton around this time. A certain understanding of the vector space model has come to be accepted as the early method for calculating document-query similarity (essentially, each of the n terms in the vocabulary becomes a dimension in n dimensional space; documents are represented as points or vectors in this space; and the nearness of these vectors is used to measure the similarity of the documents). Gerard Salton was a pioneer in statistical approaches to document-query retrieval. Anyone discussing the history of statistical IR wants a handy cite on the vector space model, and feels the cite should be to Gerard Salton. So a phantom paper by Salton called "A Vector Space Model for Information Retrieval" fits the bill perfectly.
There is a broader issue here to do with citation practice. An author of an academic paper needs to refer to a variety of ideas or facts (the origin of a retrieval model; the average length of a search query; the frequency distribution of a phenomenon) that are not central to the main point of their paper, but need to be stated for background or motivational reasons. The author feels that these facts are (or should be) generally known, or that the general idea is what matters, not the exact details. Nevertheless, such facts cannot simply be stated; they require a citation. (And, of course, a longer list of references rarely hurts a paper's chance of getting accepted.) So an assortment of such handy cites gets collected and handed on, like a useful tool box, from one researcher to another.
I should stress that there is much more to Dubin's essay than the points I have mentioned here. Dubin's argument is not just that the phantom article doesn't exist, but that the understanding of the history of the vector space model that requires such an article to exist is itself mistaken. According to Dubin, vector spaces were first introduced by Salton and collaborators as a handy way of visualising and describing similarity computations; that other papers misunderstood Salton as proposing the vector space model as a formal model of information retrieval, and criticised it on that basis; and that it was in response to those criticisms that Salton finally, in the 1980s, explicitly described the vector space model as it is now understood. The lesson is that sometimes, not only are the citations phantoms, but so too is the "common knowledge" that underlies them. So if you are writing about the vector space model, which paper should you cite? You'll have to read Dubin's essay yourself, and make up your own mind! I highly recommend it as an interesting blend of formal analysis, intellectual history, and original criticism.
Returning to the topic of phantom citations, Eamonn Keogh has a couple of slides (slides 112 and 113) about citing papers you haven't read, including a misspelling of a collaborator's name that has been carried on in later citations. Anyone know of other examples?