At the most recent TIGER reading group, Mark Sanderson presented Bland and Altman's introduction to multiple significance tests and the Bonferroni method. The basic point is simple: if you keep trying different experiments, and testing each for significance, then eventually you will find significance by chance, even where no real effect exists. Therefore, if you are performing multiple significance tests, you need to adjust your
values up.
In determining the nature and amount of adjustment required, the crucial factor is the dependence between the multiple tests. The simplest case is that the multiple tests are independent, when the Bonferroni adjustment can be applied. Put simply, this adjustment multiplies the achieved
values by the number of tests performed. So, for instance, if you perform
significance tests in parallel, and your required significance level is
, then you need to get a
value less than
on any one of these tests for it to be significant.
The example that Bland and Altman give of multiple independent significance tests is where test subjects given a medical treatment are disjointly subdivided into classes (for instance, young men, young women, old men, old women, etc.), and the treatment is tested for a significant improvement on each of these sub-classes. What is the analogy in information retrieval, say for test collection evaluation? Well, the model (odd though it might sound) is that topics are patients, and retrieval systems are treatments. Therefore, the independent-test analogy is if we divided our topics into multiple types---informational, navigational, and transactional for instance (thanks to Matthias Petri for this example)---and then tested whether our experimental system significantly outperformed the baseline on any of these query types.
The case of multiple independent tests is reasonably straightforward, but also not the most frequently encountered, at least in IR. The alternative situation is where there is a dependence between the tests. In IR, instances include using multiple effectiveness metrics, trying multiple parameter settings, and so forth. In such cases, there is a dependence between the tests; in particular, the same topics (subjects, patients) are used. This means that the dampening of
values should be less than in independent case. As Bland and Altman put it, "the probability that two correlated variables both give non-significant differences when the null hypothesis is true is [...] greater than
, because if the first test is not significant the second has a probability greater than
of also being not significant", so meeting the significance threshold in the second case is (taken in itself) stronger evidence than if the first test had not been made.
A third case is that where two systems are compared on multiple collections. In a sense, this is similar to the multiple independent test case, since the topics are disjoint. But in this case, we really want to go in the opposite direction---aggregating the different collections into a meta-analysis, to strengthen the power of the test. This is reasonably straightforward if the different collections use the same corpus (though differences in topic formation and assessment need to be considered). When collections having systematic differences between them, the problem is more demanding. But given the insufficient power of the standard 50-topic TREC collection, a principled method for performing such meta-analyses would be a valuable addition to the empirical IR toolkit.
A good point. I have heard that many are try to control the false discovery rate instead of relying on the Bonferroni-method. Bonferroni is indeed too conservative.
Brad Efron's recent book, Large Scale Inference, has a good overview of frequentist and "empirical Bayes"* approaches to multiple hypothesis testing and controlling for false discovery rate. These are very popular among genomics researchers.
There are also truly Bayesian approaches to multiple comparison which work very well in practice. Here's a nice overview using hierarchical models:
http://www.stat.columbia.edu/~gelman/research/unpublished/multiple2.pdf
I go through an example for baseball hitting ability step by step in this blog entry:
http://lingpipe-blog.com/2009/11/04/hierarchicalbayesian-batting-ability-with-multiple-comparisons/
The "hospitals" example in the first volume of examples for BUGS is another easy to understand example.
-------------------------
* "Empirical Bayes" is not a Bayesian method, because it involves a point estimate of the prior rather than using the uncertainty in the estimate of the prior like a truly Bayesian approach. The term also misleadingly implies that it's somehow more empirical than standard Bayesian methods, though it's not.
Great, thanks for the pointers!