At the most recent TIGER reading group, Mark Sanderson presented Bland and Altman's introduction to multiple significance tests and the Bonferroni method. The basic point is simple: if you keep trying different experiments, and testing each for significance, then eventually you will find significance by chance, even where no real effect exists. Therefore, if you are performing multiple significance tests, you need to adjust your values up.
In determining the nature and amount of adjustment required, the crucial factor is the dependence between the multiple tests. The simplest case is that the multiple tests are independent, when the Bonferroni adjustment can be applied. Put simply, this adjustment multiplies the achieved values by the number of tests performed. So, for instance, if you perform significance tests in parallel, and your required significance level is , then you need to get a value less than on any one of these tests for it to be significant.
The example that Bland and Altman give of multiple independent significance tests is where test subjects given a medical treatment are disjointly subdivided into classes (for instance, young men, young women, old men, old women, etc.), and the treatment is tested for a significant improvement on each of these sub-classes. What is the analogy in information retrieval, say for test collection evaluation? Well, the model (odd though it might sound) is that topics are patients, and retrieval systems are treatments. Therefore, the independent-test analogy is if we divided our topics into multiple types---informational, navigational, and transactional for instance (thanks to Matthias Petri for this example)---and then tested whether our experimental system significantly outperformed the baseline on any of these query types.
The case of multiple independent tests is reasonably straightforward, but also not the most frequently encountered, at least in IR. The alternative situation is where there is a dependence between the tests. In IR, instances include using multiple effectiveness metrics, trying multiple parameter settings, and so forth. In such cases, there is a dependence between the tests; in particular, the same topics (subjects, patients) are used. This means that the dampening of values should be less than in independent case. As Bland and Altman put it, "the probability that two correlated variables both give non-significant differences when the null hypothesis is true is [...] greater than , because if the first test is not significant the second has a probability greater than of also being not significant", so meeting the significance threshold in the second case is (taken in itself) stronger evidence than if the first test had not been made.
A third case is that where two systems are compared on multiple collections. In a sense, this is similar to the multiple independent test case, since the topics are disjoint. But in this case, we really want to go in the opposite direction---aggregating the different collections into a meta-analysis, to strengthen the power of the test. This is reasonably straightforward if the different collections use the same corpus (though differences in topic formation and assessment need to be considered). When collections having systematic differences between them, the problem is more demanding. But given the insufficient power of the standard 50-topic TREC collection, a principled method for performing such meta-analyses would be a valuable addition to the empirical IR toolkit.