Via Andrew Gelman and Howard Wainer, an interesting meta-analysis from 2005 by Pan, Trikalinos, Kavvoura, Lau, and Ioannidis (the last of Why most published research findings are false fame), comparing reported statistical significance and effect sizes in studies of genetic propensity to disease, between studies performed in mainland China and those performed elsewhere. There is a widely-recognised publication bias in all scientific literature towards statistically significant results. In medicine, this frequently plays out as an initial publication that finds statistical significance and a strong effect for a treatment, gene-disease link, or other hypothesis, which is succeeded by follow-up studies that report weaker effects or fail to find significance. The pattern is understandable even if all published studies come from the same meta-population of performed studies: non-significant studies aren't publishable until a (perhaps by chance) significant study legitimizes the gene-disease link.
Pan et al. find, though, that follow-up Chinese studies are more likely to report statistical significance than non-Chinese studies (48% to 18%), even though the Chinese studies generally have smaller sample sizes. Reported effect sizes are also much stronger. One might posit a still stronger publication bias towards significance in local Chinese than in international journals, but the proportion of Chinese studies published in PubMed-indexed journals is even higher than in non-PubMed indexed ones (65% to 46%), albeit across a much smaller number of samples (20 to 141). It seems that Chinese studies, at least pre-2005, were inherently more likely to find significance, perhaps due to weaker experimental controls. Of course, the same analysis might well hold for other national-level studies, particularly in developing countries.
I find the study by Pan et al. most interesting because of its warning about the interpretation of statistical significance. Significance is formally defined as the likelihood that the observed result would by chance occur even if the null hypothesis (usually of "no effect") were true. But when trying, as a third party, to interpret reported significance for an experiment, other likelihoods become more important. What, for instance, is the likelihood that this published significant result is the chance tip of an iceberg of unpublished, non-significant results for the same experiment? What is the likelihood that significance has occurred because of a weakness in the experimental setup (for instance, choosing a weak baseline in the evaluation of an IR retrieval technique), or even conscious manipulation of results by the researcher? We can, for instance, only tell that the reported significance rate in the Chinese studies surveyed by Pan et al. is unusual becaue there is a comparison set of non-Chinese studies.
Conversely, other forms of evidence and reasoning, currently deprecated in favour of the experimental test for statistical significance, deserve more weight, if only in relative terms. In the medical field, there is an ongoing debate between Evidence-Based Medicine (EBM) and Science-Based Medicine (SBM). EBM can be summarized as holding that "randomized control trials (RCTs), and only randomized control trials, constitute proof or disproof of a treatment's effectiveness". SBM counters that other forms of evidence or reasoning, such as a plausible mechanism of operation, or observational studies, are often required. For instance, the RCT is overly favourable to scientifically dubious treatments, due to twin effects: some studies of treatments with null effects will by chance achieve significance; and studies failing to find significance are not in themselves conclusive. A case in point is the succession of studies, and even meta-studies, into the effectiveness of intercessory prayer as a medical treatment, that find no significant effect and yet call for "further study", despite the cost of such study and the inherent scientific implausibility of the proposed mechanism. (The blog post linked to above remarks amusingly on the difficulty of performing a double-blind trial "when the putative agent of the effect is an omniscient being".) Also, conclusions must sometimes be reached on treatment effectiveness where RCTs are not possible, for instance because observational studies have both indicated no positive effect and suggested there may be negative sides effects (see "If there are no randomised controlled trials, do we always need more research?).
What does all of this mean for the practice of empirical research in computer science, and in particular in information retrieval? Since the medical field has developed the techniques of experimental statistics to a particularly high standard, its methodologies frequently inspire, or even are mimicked by, those in experimental computer science. (I've participated in more than one conversation that ran "assume our topics are patients, and our retrieval algorithms are treatments...".) This inspiration can be misleading, as the fields are very different: human subjects of medical treatment are far more fixed, homogeneous, and generalizable than, say, information needs in retrieval; the notions of a population and a sample are much more clearly defined in the former case than in the latter. But even in the medical field, as the examples cited above show, the levels of the hierarchy of evidence are not as absolutely incommensurable as they have sometimes been treated. Reported statistical significance in a randomized control trial is not always what it seems; the significance of significance must be assessed using other evidence; and sometimes a RCT is simply not available. In empirical computer science, too, we should question whether a line of research possessing a methodology which allows the calculation of statistical significance is therefore really more rigorous (let alone more relevant) than one for which such statistical apparatus is not available.
Update 16/05/2011: More on significance in Chinese studies in this New Humanist article by Sam Geall:
Publication bias – the tendency to privilege the results of studies that show a significant finding, rather than inconclusive results – is notoriously pervasive. One systematic review of acupuncture studies from 1998, published in Controlled Clinical Trials, found that every single clinical trial originating in China was positive – in other words, no trial published in China had found a treatment to be ineffective.