Social scientists are often accused of running studies that confirm the obvious, such as that people are happier on the weekends, or that having many meetings at work make employees feel fatigued. The best response is, what seems obvious may not actually be true. That, indeed, is what we found in a recent experiment. We set out to confirm that giving relevance assessors more detailed guidelines would make them more reliable. We found it didn't.
The setup of our experiment was as follows. Two assessors were asked independently to re-review documents assessed in the TREC Legal Interactive task. These documents were divided into two batches. For the first batch, the assessors were given only the sentence-long topic statement; for the second, they were provided with the multi-page detailed guidelines written by the topic authority. Their assessments on each batch were then compared with the official TREC assessments, vetted by the same topic authority, which we took as the objective standard. This was done first in a trial experiment with 150 documents, then (on a different topic) in a full experiment, with 450 documents.
Our hypothesis was that the more detailed guidelines would lead to stronger agreement with the authoritative assessments, and between the experimental assessors, too. We were somewhat surprised when this did not happen in the trial experiment, and very surprised when it did not happen in the full one, either. The precise Cohen's scores we observed were as follows:
|Batch||Full experiment||Trial experiment|
|A v B||A v O||B v O||A v B||A v O||B v O|
(A and B being the experimental assessors, and O the official TREC assessments). There's not even an upwards trend here, let alone a significant improvement. And the confidence intervals on the changes are roughly ± 0.15 (see our paper for more details), so we can be confident that even if there was some underlying improvement that we've missed by unlucky sampling, it was not a major one.
It is interesting to speculate on what might be the cause of this (lack of) effect---perhaps the detailed guidelines are just too much information for the assessor to hold in their mind, and confuse as much as they inform. We can only draw tentative conclusions from an experiment with only two assessors, on only two topics, especially since the results run counter to common sense.
But if the results are confirmed by further experiments, then the finding is not encouraging for the reliability of delegated manual review. Clearly, the topic statement is more ambiguous than the detailed guidelines. We would not, in an e-discovery production, place much confidence in a review where the reviewers were given a sentence-long description of the production request and left to figure it out for themselves. But (it would seem) giving them more detailed guidelines does not actually improve their reliability over this dubious baseline, at least if not combined with careful supervision. Another point for the machines?
We'll be presenting our poster at SIGIR 2012. If you're at the conference, please drop round and discuss our findings with us. We'd love to hear your opinion.