If I had a ten-thousand-node cluster...

Another aspect of this year's SIGIR reviewing, though hardly a new one, was the number of papers from industry. As a fellow reviewer observed, these often come across as nicely polished internal tech reports; the kind of thing you should probably read once your globally-distributed search infrastructure has reached the million-query-a-day mark; a sort of post-coital pillow talk between the Yahoo and Bing engineering teams, upon which the rest of us are eavesdropping (and Google presumably doesn't care to).

My own concern is not so much with the general applicability of this work, as it is with its reproducibility. In general, industry research is evaluated over datasets to which other researchers have no access. There is, therefore, no hope for the direct validation of the reported results; and given the specificity of the environments and datasets described, little hope for indirect reproduction, either.

Anyone who is working in experimental data analysis who has not already done so needs to read Keith A. Baggerly and Kevin R. Coombes, Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology, The Annals of Applied Statistics, 3(4) (2009). In it, the authors reverse-engineer microarray studies on responsiveness to cancer treatment, studies which omit the data and details needed for direct reproduction. Baggerly and Coombes find a catalogue of errors: sensitive and resistant labels for subjects switched; data columns offset by one; faulty duplication of test data; incorrect and inconsistent formulae for basic probability calculations; and so forth. They comment that "most common errors are simple ... [and] most simple errors are common". And these are not in obscure papers, but in large-team studies, which have lead to patent grants and clinical trials -- trials in which (often enough) errors in the original papers meant that patients were being given contra-indicated treatments.

Fortunately, in information retrieval, no lives are at risk; what industrial research groups do in the privacy of their data centers is between them and their multiply-redundant servers. But scientifically, we are in the same situation, or worse. We are accepting for publication research that we have no way of checking or validating, that could be riddled with the most basic of errors. As readers and reviewers, we have only our sense of whether something feels right, and that for a domain in which we have little or no direct experience. Is this really desirable?

14 Responses to “If I had a ten-thousand-node cluster...”

  1. Matthias says:

    True. Then, again, nobody validates other people's papers anyway, so why all the fuzz about industry papers?

    Don't get me wrong. I'd love to see papers validated or refuted by the IR research community. But my own experience in working on a mere 5-year-old paper was rather bleak. Sorry, the data isn't there anymore. I mean, okay. Fair enough. It was an industry paper. But I cannot remember stumbling over any paper that tried to repro another researcher's publication.

    And: who has the time? I'd rather work on new stuff than risk that repeating a fellow researcher's experiment just results in "yes, that experiment was okay - nothing to see here - move on". Nobody publishes that kind of stuff.

    Plus: we aren't the Cochrane Collaboration. In medicine you can validate a paper while you're doing your daily work. In IR you have to set-up an experiment which is really expensive - in interactive retrieval both in time and money to pay to the participants.

    I'd love to see a spot in each conference proceedings (or journal - but that is a whole different story - see the current CACM) open for reproduced experiments. Otherwise people are competing for spots and novelty will always win over boring redundance.

  2. Ian says:

    In lab-style IR experiments there is no excuse for irreproducibility. The data is available, the relevance judgments are built in a manner which demands interchangable processes (TREC doesn't work if everyone doesn't do the same task).

    However, historically, venues do not demand this. I feel that we should. It's easy for me to say, I spend my days giving data away, but the arguments are compelling and important.

  3. Our traditional peer review model is precisely limited by the fact that all you can do as a reviewer is check whether it "looks good". You just don't have time to reproduce anything, whether you have access to giant computational means or not.

    I would also point out that in fields other than Computer Science, it is quite frequent to publish work which requires very expensive hardware.

    If something is "peer reviewed", in the conventional sense, it does not mean that it is "correct", only that "it looks good according to a self-selected set of experts".

    Come on! Over half the research papers are never cited, and most cited paper are never fully read by anyone outside the authors and maybe the reviewers. Would you really bet your life of the correctness of any paper taken at random among accepted SIGIR papers?

    For specific thoughts, please see my blog: (sorry for the plug):

    Can Science be wrong? You bet!
    http://lemire.me/blog/archives/2010/09/17/can-science-be-wrong-you-bet/

    The mythical reproducibility of science
    http://lemire.me/blog/archives/2010/04/20/the-mythical-reproducibility-of-science/

  4. william says:

    Thanks all for your comments.

    Matthias, I agree, interactive studies are expensive to repeat; but as Ian points out, data-driven studies, if properly packaged, should be repeatable by invoking a shell script.

    We have a weak culture of reproducing other research. As a senior IR researcher pointed out to me, this is in part because we are an under-resourced field (compared to, say, medical research, or even language technologies), and (as Matthias says) we therefore focus our efforts on new work.

    But reproducibility doesn't have to mean fully re-running the experiments. As a reviewer, while (as Daniel says) I don't have time to do full re-production, I do occasionally have a look at the dataset that a paper is using to check if some surprising result they are reporting makes sense. I did this for one public-dataset SIGIR paper, and the result did make sense; I wanted to do it for one of the industry papers, but the data was not available.

    Also, there are certainly a number of studies, often quite famous (or, for poor reproducibility, notorious) ones, where subsequent researchers have tried and failed to achieve the same effectiveness levels reported by the original authors. As Ian says, there really should be no excuse for this: these studies are almost all data-driven, and their reproducibility should be automated.

  5. Matthias says:

    It sounds like we all agree that it would be great to be able to reproduce and validate other peoples experiments (OPE - how can I explain it?). Now the question is: how to achieve this? Publishing one's source code, e.g., won't cut it because there is no incentive for it. In fact, there is an incentive not to publish it: if I publish my code and/or my data I expose all my flaws and that of my research be it buggy indexing or bad interpretation of the data. It allows strangers to endanger my (currently not existing) scientific reputation or at least that as a programmer. Furthermore it allows other researchers to continue my work and maybe publish new results before I do.

    I could imagine that reproducibility can be achieved by a three-prong approach: a) publications should demand the data and software needed to repro a data set used in a paper, b) publications should reserve slots for reproduced studies - regardless of their outcome and c) creating a climate in which the guy who reproduces a fellow researchers experiment isn't the laughing stock of the community (as Leonhard is in The Big Bang Theory).

    And imagine how cool that would be: flaws in studies would be discovered. Researchers could learn from mistakes of fellow researchers - and their own. Knowledge from faulty studies could be corrected, leading to better models and methods. Maybe even the TREC results would, at last, show an upward trend ;-) And the IR community could keep on boasting with confidence of the great experimental culture that it has.

    If this really caught on, even those failed attempts at reproducing studies would lead to a larger data set making meta studies feasible. Especially in interactive retrieval, I'd bet that tons of things cannot be reproduced between different cultures. Take eyetracking of web pages as an example. Observe a Chinese and an American reading web pages and you'd get entirely different results. That doesn't mean the study is flawed - just our model of how people read.

  6. eugene says:

    Non-industrial papers have the same problems. Once I've tried to reproduce some results from a non-industrial paper (a conference with acceptance rate ~7%), which used open dataset for experiments. But instead of giving a full description of optimization procedure authors just write "we performed intensive parameter selection". I've got results different from those in the paper.

    The reasons are obvious: due to space constraints you cannot thoroughly describe your research in a conference paper; it is usually easier to make original and interesting research with a closed dataset.

    In my opinion IR is sick; IR is not science while we cannot ensure full reproducibility: open datasets and open-source code.

  7. william says:

    matthias: change can come through a gradual evolution in reviewing standards. If reviewers start up-marking papers that enable full reproducibility (downloadable source and data, scripted and ready to run), then those papers will start pushing out the unreproducible ones.

    Conference publication is an obstacle here: there's no return cycle by which the reviewer can request the author to provide data and code; there's a rush to submit that prevents packaging of experiments, and a rush to review that prevents testing them; and there's no scope for appendix material that expands upon methods and procedures.

  8. Jon says:

    I, too, have had trouble reproducing results from several different studies (industry and academic), even using the code researchers have published and the same datasets! There are nuances of running experiments that never make it into publication, and as all of you have said, almost no incentive to distribute experimental tools into an easily executable package.

    But, there are some excellent research groups, who continually produce high-quality software & data along with their publications. Thorsten Joachims's group at Cornell comes to mind. Their papers heavily cited & the software is used as a baseline again & again. The LETOR dataset effort at MSR is also commendable - everyone doing research in this area should (and usually does) include comparisons with that dataset. They include copious details on the parameter tuning for most of the baselines distributed with the data. (Many IR research groups do a great job of this, but its still not common practice to distribute parameter files, list software versions, etc. when publishing.)

    I have the impression that the success of these efforts, and of TREC, has been through the steady involvement of passionate & determined senior researchers leading them. Not all IR researchers fortunate enough to have such high-quality guidance, and it shows. As an IR PhD student, there are high incentives to publish papers, but not much else. Good, usable software design & support aren't rewarded.

  9. Itman says:

    I agree with other comentors, who stated that reproducibility in general may be problematic. Yet, industry paper are especially bad with this respect. Partly, because the corporate publishing is a maneuvering between Scylla of PR and Charybdis of corporate paranoia.

  10. Matthias says:

    William: how about "Conditional accept: Give us your code and data to share with all the conference attendees and we publish your paper"?

  11. I think the right question is not whether we should require every piece of work to be reproducible, but what's the most efficient way to spend our effort to move science forward. I suppose lots of people care about rewarding whoever invented something, but just about everything interesting has multiple authors and approaches working on it over time anyway.

    There's an incentive to publish your code and open source everyhting, including ideas. If the source works, everyone will use it and your citation count will go through the roof if only because of the software. If you're halfway decent at what you're doing, you'll have way more ideas than time to execute them and like everyone else, you'll value feedback. The usual problem is getting people to read about your ideas, not trying to keep them safe from being stolen!

    I basically size up the ideas in a paper, decide if I believe them, then try some of them out. I think others do the same, and a body of techniques that work across a broad range of problems emerges. It's what the biologists would call "biological replicates" (same experiment on different sample) rather than "technical replicates" (same experiment on the same sample).

    I have also spent a good deal of time over the last ten years replicating papers. Many of the tutorials for LingPipe are based on replicating papers in the literature. Most things worked. This is also common in stats. You borrow someone else's model and apply it to your own data. So I can't replicate Blei and Lafferty's experiments on the journal Science over 100 years, but I can do it with 5 years of the NY Times.

    There is the nagging problem of parameter settings. Michael Collins's parser being the most notorious example in natural language processing. Good idea, but nobody could replicate it even given his thesis. He was open about explaining how he hacked all the boundary conditions, but it took a 50-page journal paper to present all of these. That was only done because Michael released an executable that ran the model and it clearly worked really well. On new data, too. So that's a kind of positive validation that what he did worked, but negative replicability if you judge narrowly of doing exactly what was written.

    Part of the problem is that papers are more like recipes than robot microcode. A great chef's likely to do better using the same recipe as an amateur. They know how all the tools work and how to use them even if they've never cooked a particular dish. So all those vague details that can trip up an amateur don't bother them at all. And like with recipes, writing is a very different skill than cooking.

  12. Danny Calegari says:

    "Fortunately, in information retrieval, no lives are at risk"

    Lives are always at risk in basic science, even if it is in the abstract sense that the potential to save lives 10 or 50 years from now is diminished because progress in a field stalls. We have no idea what scientific or technological advances might build on fast, reliable, accurate information retrieval in the future, especially as the sources of such information proliferate. People already google words or phrases to determine correct spelling. Maybe in 10 years your phone will google the street you're walking down to see if anyone was mugged there recently.

  13. Elad says:

    I think you’ll agree that industry is restricted by the data it can make available. I still remember the outcry over the AOL query log. So the question is more, should we allow industry to publish even if the authors can't make their data or their code available.

    I general, it seems to me that we should distinguish between two forms of reproducibility, which I will term strong reproducibility and weak reproducibility. The former means that you can take the concepts identified in a paper (say, that TF-IDF is a useful weighting method) and show that you can obtain similar gains to those found in the paper, but not on the same data. The latter means that you can take the same data and code and reproduce the same numbers that the paper cites.

    I argue that strong reproducibility is the form to which we should aim, and that for this kind of reproducibility, it doesn’t matter that Industry can’t make its data available, as long as someone can find similar data.

    Weak reproducibility is less interesting, unless you assume that the authors made significant errors, in which case, strong reproducibility will catch them. Furthermore, for weak reproducibility, it doesn’t really matter if you’re talking about industry or academia, since there are always numerous details (how was that parameter set?) and the data isn’t always available, even if it seems so at first (i.e., the student who crawled some random subset of a web site at a specific time). TREC is the exception, but there is only so much you can do with TREC data.

    Full disclosure: I work for industry.

  14. Matthias says:

    @Elad: Good point. But you'd certainly agree that one cannot assume strong reproducibility if weak reproducibility isn't given. Information Retrieval is a field where nobody can predict anything about a combination of corpus, retrieval model, weighting function and parameters (and user population, if you're into interactive retrieval) or any subset of these. So strong reproducibility is just a pipe dream.
    Start with a TREC ad hoc corpus, BM25 and US college grad students. Now change the US students for Chinese students. Same results? I doubt that. Replace the TREC corpus with German newspapers. Won't work anymore unless you spend a few hours on a ten-thousand-node cluster optimizing the parameters. (I mean, it's not like there is a function that takes a corpus as a parameter and outputs the best parameters for any weighting function.) Now move from German newspapers to German books.
    That also means every time somebody cannot repro the experiment of a fellow researcher, the fellow can always say "strange, it works for my setup - unfortunately I must not tell you what I used." and the wonderful situation emerges where everybody can publish anything as long as it doesn't have too many typos, lists the usual suspects in the references and has results that are better than some arbitrary baseline.
    Yes, I rant. It's my thing.

Leave a Reply