In my last post, I discussed an experiment in which we had two assessors re-assess TREC Legal documents with less and more detailed guidelines, and found that the more detailed guidelines did not make the assessors more reliable. Another natural question to ask of these results, though not one the experiment was directly designed to answer, is how well our assessors compared with the first-pass assessors employed for TREC, who for this particular topic (Topic 204 from the 2009 Interactive task) happened to be a review team from a vendor of professional legal review services. How well do our non-professional assessors compare to the professionals?
To answer this question, I'll take the official TREC qrels as gold standard, as Maura Grossman and Gord Cormack do in their paper comparing technology-assisted with manual review. These qrels are derived from the first-pass assessments after alleged errors have been appealed by participants and adjudicated by the topic authority (see the TREC 2009 Legal Track overview for more details). This topic authority is also the author of the detailed relevance criteria used by the TREC assessors and (in the second batch) by our experimental assessors in performing their assessments. We'll measure reliability using mutual score (also known as positive agreement) between the TREC or experimental assessors on the one hand, and the official assessments on the other. (Cohen's is an alternative measure, but I prefer mutual for the current purposes as it is more easily interpretable in terms of retrieval effectiveness. A discussion of the two measures can be found in Section 3.3.2 of our draft survey paper on Information Retrieval for E-Discovery.)
There's a slightly tricky question about sampling rates and gold standard reliability to consider also. (Readers not interested in statistical niceties can skip this and the next paragraph.) The sample for our experiment was designed to produce an even balance of officially relevant and officially irrelevant documents, since this provided the greatest statistical power for the comparison we were making (in that case using Cohen's ). As it happens, this sampling also increases the concentration of appealed documents. So directly comparing results on only the experimental sample would be unfair to the TREC assessors. Instead, we perform a sample-based extrapolation from the experimental assessments to the sample drawn for assessment at TREC (though not to the full corpus, since this would increase estimate variance without fundamentally changing estimate accuracy).
A second nicety is in the handling of what I call the "bottom stratum", of documents retrieved by no TREC participant. This stratum was very lightly sampled from the TREC to the experimental sample (as it was from the corpus into the TREC sample), so (in extrapolation) each document here has a big impact upon comparisons of effectiveness. At the same time, no team had an incentive to (and no team did) appeal assessments of irrelevance in this stratum, so any false negatives (actually relevant documents assessed as irrelevant) will have been missed. Thus, including the bottom stratum potentially biases the comparison in favour of the TREC assessors. I therefore report results both with and without the bottom stratum considered.
The statistical prologemena done with, here are our results. The original experiment involved three treatments: a first batch with the topic statement only; a second batch with the detailed criteria of relevance; and then both batches jointly re-assessed by both assessors in a (mostly successful) attempt to reach agreement on relevance. Below, we report mutual scores with the official qrels for each batch. The TREC assessments were not (and cannot) be divided into the same treatments, so only the one score is reported for all three treatments. (Put another way, each batch provides an estimate on the full evaluation sample under different treatments). The TREC assessors worked independently to the topic authority's detailed guidelines; this therefore is the fairest comparison with our experimental assessors. I also show the unextrapolated score of the strongest participant team for this topic (the industry team included in Grossman and Cormack's analysis).
Here then, finally, are the results:
|With bottom stratum||Topic||0.83||0.22||↓||↓|
|Without bottom stratum||Topic||0.83||0.68||↓||↓|
We can see that, if the bottom stratum is excluded, both our assessors (Exp-A and Exp-B) outperform the TREC assessor. For each of the two batches, Exp-B found 2 officially irrelevant documents to be relevant, so if this stratum is included in the comparison, Exp-B's score is depressed to below that of the TREC assessor (note the strong sampling effect in this stratum), though assessor Exp-A's and the joint-review score is unaffected. It was noted above that including the bottom stratum is potentially biased in favour of the TREC assessors, though it should be said that for the four documents in question, our assessors jointly concluded that they weren't in fact relevant (and, having viewed them myself, I concur).
I've not checked the above results for statistical significance. Given the sampling complexities involved, that would be a tricky thing to do. My suspicion is that the differences without the bottom stratum may be significant, but with the bottom stratum likely are not, due to the much greater sampling variability in the latter case. More to the point, though, this is only a single topic, in an experiment not directly set up to measure the comparison made here (and where even the official assessments might contain some remaining errors or ambiguities); even a finding of statistical significance would not show that our assessor A is "provably more reliable" than the TREC assessors. There are also differences of conditions involved; our assessors worked three-hour shifts, assessing around 60 documents (assessor A) to 30 documents (assessor B) an hour, which are less fatiguing conditions than perhaps are typical in large-scale manual review (though the conditions of the professional TREC review are not specified in the TREC overview document). The conclusion that can be reached, though, is that our assessors were able to achieve reliability (with or without detailed assessment guidelines) that is competitive with that of the professional reviewers -- and also competitive with that of a commercial e-discovery vendor.
Time to meet our crack reviewers:
At the time of the experiment, Marjorie and Bryan were high school seniors working in the E-Discovery lab as interns four mornings a week. (Since then they have become ex-high-school students waiting for their summer holidays to end and their studies at the University of Maryland to commence.) They have no legal training, and no prior e-discovery experience, aside from assessing a few dozen documents for a different TREC topic as part of a trial experiment. They performed assessments on TIFF files, displayed on rather underpowered laptops that couldn't fit the full TIFF onto the screen and tended to crash every now and then. They worked independently and without supervision or correction, though one would be correct to describe them as careful and motivated.
All of this raises the question that is posed in the subject of this post: if (some) high school students are as reliable as (some) legally-trained, professional e-discovery reviewers, then is legal training a practical (as opposed to legal) requirement for reliable first-pass review for responsiveness? Or are care and general reading skills the more important factors?
As it happens, the same question has been addressed in a couple of other recent studies, which reached similar results to our own, and thus add some support to our tentative finding, namely: that legal training does not confer expertise in document review. In A User Study of Relevance Judgments for E-Discovery (Proc. ASIST, 2010), Jianqiang Wang and Dagobert Soergel had four law school and four library and information studies (LIS) students re-assess documents from four TREC topics. In an exit interview, the law students stated that their legal training and experience was helpful in performing their assessments, whereas the LIS students felt such experience would not have helped much. The LIS students appear to have been correct: there was little or no difference between the two assessor groups in reliability or speed. In another study, Legal Discovery: Does Domain Expertise Matter? (Proc. ASIST, 2008), Efthimis N. Efthimiadis and Mary A. Hotchkiss had six groups of MLIS students build a run for the TREC 2007 interactive task. Two of these groups consisted of law librarianship students with JD degrees and professional experience as lawyers or legal searchers; but they achieved no greater reliability than the remaining four groups, who had no legal experience.
Some TREC 2009 topics were initially assessed by law students, other by professional reviewers. The difference in expertise level can be considered here, though with two caveats: first, we're comparing assessors on different topics, and variability in the assessment difficulty of topics appears to be high (see Table 3.5 of our draft survey); and second, we don't know how the professional reviewers conducted their reviews, and there could be differences in process as well as raw expertise. Of the five topics in the TREC 2009 Legal Interactive task that were appealed with sufficient thoroughness for the us to be confident that most errors were found (see my SIRE 2011 paper, Re-examining the Effectiveness of Manual Review, for a fuller discussion), three were initially assessed by professional review teams, two by volunteer law students. The picture of the benefits of expertise is mixed for this dataset. If we consider only the documents actually assessed (without making a variance-increasing extrapolation to the corpus), then the law students seem roughly as reliable as the professional reviewers (Table 4 of Re-examining), though if we do extrapolate to the full population, one of the professional review teams does outperform the two student and other two professional teams in both reliability (ibid., Table 2) and consistency (ibid., Figure 2) -- though perhaps this is explained by good process (and care in recruitment), rather than greater legal training.
Now, even if we take at face value the finding that legal training does little to improve reliability in first-pass document review, this does not mean that there are not systematic differences in the skills of reviewers (some will be more careful and attentive readers than others, for instance), or the accuracy of professional review teams (depending upon the quality of their processes and their care in recruiting said skilled reviewers). I'd not recommend rounding up high school students on summer break to save review costs. And of course legal training and expertise are required to frame, monitor, and interpret the production. Our finding might suggest that first-pass document review is not practice of the law (since it apparently does not employ legal skills), and therefore that using non-lawyers to perform it should not be considered unauthorized practice of the law, but I'll offer a bottle of wine to the lawyer brave enough to argue that in court on the basis of this blog post. Perhaps in the construction of e-discovery test collections, we do not have to insist that all reviewers have legal training, though again this depends upon the acceptability of this to the practicing e-discovery community. The research implications are clearer: we don't know what makes a reliable reviewer; legal training doesn't appear to play a major part, which is a negative outcome; but that in compensation means that findings on the conception and perception of relevance from outside the e-discovery domain can with more confidence be applied within it.