How accurate can manual review be?

One of the chief pleasures for me of this year's SIGIR in Beijing was attending the SIGIR 2011 Information Retrieval for E-Discovery Workshop (SIRE 2011). The smaller and more selective the workshop, it often seems, the more focused and interesting the discussion.

My own contribution was "Re-examining the Effectiveness of Manual Review". The paper was inspired by an article from Maura Grossman and Gord Cormack, whose message is neatly summed up in its title: "Technology-assisted review in e-discovery can be more effective and more efficient than exhaustive manual review".

Grossman and Cormack compare the reliability of manual and technology-assisted review through an analysis of the assessment results for the Interactive Task of the TREC 2009 Legal Track. In the Interactive Task, runs produced by participating teams (using various degrees of automation) had documents sampled from them for initial relevance assessment by student or professional assessors; unretrieved documents were also sampled, though much more sparsely. Teams could appeal the initial assessments to be adjudicated by a figure called the Topic Authority, an experienced e-discovery lawyer who had advised the teams in their production and directed the assessors in their review.

Grossman and Cormack's insight is to view this assessment process as an experiment comparing manual and technology-assisted review. The post-adjudication relevance assessments are treated as the gold standard; the assessors for each topic as the manual review team; and the documents they assessed as relevant (extrapolated based upon sampling) as the productions resulting from exhaustive manual review. The effectiveness of this manual review is then compared with the technology-assisted submissions of participating teams against the adjudicated gold standard.

Grossman and Cormack find that the best technology-assisted productions as accurate as the best manual review team, and more accurate than the majority of review efforts. The following table (adapted from Table 7 of Grossman and Cormack) summarizes their results:

Topic Team Rec Prec F1
t201 System A 0.78 0.91 0.84
TREC (Law Students) 0.76 0.05 0.09
t202 System A 0.67 0.88 0.76
TREC (Law Students) 0.80 0.27 0.40
t203 System A 0.86 0.69 0.77
TREC (Professionals) 0.25 0.12 0.17
t204 System I 0.76 0.84 0.80
TREC (Professionals) 0.37 0.26 0.30
t207 System A 0.76 0.91 0.83
TREC (Professionals) 0.79 0.89 0.84

My paper at SIRE was in part an attempt to pick holes in Grossman and Cormack's analysis; their conclusions proved to hold up pretty well to scrutiny. One possible criticism is that the effectiveness figures quoted above are extrapolated from an unequal sample, and that a small number of appeal decisions on sparsely sampled unreturned documents have a disproportionate effect on measures of effectiveness. This criticism doesn't have a lot of statistical purchase -- the sample was random and the extrapolations are correct, unless one thinks there is likely to be greater bias in the unretrieved segment -- but practical considerations give it some more bite: a Boolean pre-filter (itself, admittedly, a blunt sword) would reduce the size and impact of this segment; and errors here are mostly false positives, which would be picked up (presumably) in a second pass. However, even if you calculate effectiveness only on the documents actually sampled and assessed (Table 4 of my paper), the best technology-assisted system still comes out on top for three topics out five, and even for the other two.

A second criticism might be that the appeals process could be incomplete: not all assessor errors may be appealed, and therefore some will remain unfound. But this incompleteness would in general boost the apparent effectiveness of the manual review teams more than that of the technology-assisted productions. (That said, I do suspect that absolute recall scores are overstated for both manual and technology-assisted review, as false negatives in the unretrieved segment would not get appealed.)

The remainder of my paper examines what TREC 2009 tells us about assessor reliability and variability, and whether the TREC setup really is a fair depiction of manual review. One of my findings is that there is great variability in the reliability of different reviewers on the one team, as the following figure shows:



Each circle represents the precision and recall of a single assessor, measured against the adjudicated assessments; the red cross in the effectiveness of the technology-assisted production. (Measures are extrapolated from the sample to the population.) The best reviewers have a reliability at or above that of the technology-assisted system, with recall at 0.7 and precision at 0.9, while other reviewers have recall and precision scores as low as 0.1. This suggests that using more reliable reviewers, or (more to the point) a better review process, would lead to substantially more consistent and better quality review. In particular, the assessment process at TREC provided only for assessors to receive written instructions from the topic authority, not for the TA to actively manage the process, by (for instance) performing an early check on assessments and correcting misconceptions of relevance or excluding unreliable assessors. Now, such supervision of review teams by overseeing attorneys may (regrettably) not always occur in real productions, but it should surely represent best practice.

My SIRE paper has since generously been picked up by Ralph Losey in his blog post, Secrets of Search -- Part One. Ralph's post first stresses the inadequacy of keyword search alone as a tool in e-discovery. Predominant current practice in e-discovery is to create an initial Boolean query or queries, often based upon keywords negotiated between the two sides; run that query against the corpus; and then subject the documents matched by the query, and only those documents, to full manual review. The query aims at recall; the review, at precision. However, previous work by Blair and Maron (1985) -- almost three decades ago now -- found that such Boolean queries typically achieve less than 20% recall, even when formulated interactively. (Note, in passing, that the concepts of "Boolean search" and of "keyword-based ranked retrieval" are frequently confounded under the term "keyword search" in the e-discovery literature.) Ralph also questions whether the assessment process followed at TREC really represents best, or even acceptable, manual review practice, due to the lack of active involvement of a supervising attorney.

The most interesting part of Ralph's post, and the most provocative, both for practitioners and for researchers, arises from his reflections on the low levels of assessor agreement, at TREC and elsewhere, surveyed in the background section of my SIRE paper. Overlap (measured as the Jaccard coefficient; that is, size of intersection divided by size of union) between relevant sets of assessors is typically found to be around 0.5, and in some (notably, legal) cases can be as low as 0.28. If one assessor were taken as the gold standard, and the effectiveness of the other evaluated against it, then these overlaps would set an upper limit on F1 score (harmonic mean of precision and recall) of 0.66 and 0.44, respectively. Ralph then provocatively asks, if this is the ground truth on which we are basing our measures of effectiveness, whether in research or in quality assurance and validation of actual productions, then how meaningful are the figures we report? At the most, we need to normalize reported effectiveness scores to account for natural disagreement between human assessors (something which can hardly be done without task-specific experimentation, since it varies so greatly between tasks). But if our upper bound F1 is 0.66, then what are we to make of rules-of-thumb such as "75% recall is the threshold for an acceptable production"?

These are sobering thoughts; but there are perhaps reasons not to surrender to such a gloomy conclusion. We need to remember that the goal of the production (immediately, at least) is to replicate the conception of relevance of a particular person, namely the supervising attorney (or topic authority), not to render a production that all possible assessors would agree upon. And some productions appear surprisingly good at replicating this conception. Consider the effectiveness scores reported for the top-performing systems (and the most reliable manual review team) in the above table; these systems are achieving F1 scores from the high 70s to the mid 80s, as evaluated against the TA's conception of relevance (as reflected in the adjudicated assessments). That is, while two assessors or producers might independently come to differing conceptions of relevance, it seems that if one producer or (attentive) assessor is asked to reproduce the conception of relevance of another, they are able to do a reasonably faithful job. (Mind you, we are eliding the possibility here that the TA is confused about their own conception of relevance. Certainly, their conception of relevance can shift over time; and the detailed criteria of relevance against which assessments and appeals were made were not formulated at the beginning of the production process, but after tens of hours of conception-clarifying interactions with the participating teams.)

A distinction also needs to be drawn between disagreement arising from differing conceptions of relevance, and disagreement arising from assessor error (that is, assessors making decisions through inattention or incompetence that they would not make if paying full attention and possessing the requisite skills). The distinction between disagreement and error is important to the current discussion, because error can be corrected (albeit at some expense), while disagreement is largely irreducible in the typical assessment setting.

Maura Grossman and Gord Cormack address just this question in another paper, "Inconsistent assessment of responsiveness in e-discovery: difference of opinion or human error?" (DESI, 2011). In that paper, they re-reviewed documents from TREC 2009 in which the topic authority and the first-pass assessor disagreed. Re-reviewed documents were rated inarguably responsive, arguable, or inarguably non-responsive, based upon the topic authority's responsiveness guidelines. They then compare these ratings with the first-pass and official assessments, to divide them into cases in which the TA's adjudication was inarguably correct (and the assessor has made an inarguable error); cases in which responsiveness was arguable; and cases in which the TA was inarguably incorrect (and the assessor inarguably correct). While rates vary between topics, on average they found the TA's adjudication to be inarguably correct in 88% to 89% of cases. In other words, Grossman and Cormack conclude that the great majority of assessor errors are due to inattention, or to an inability properly to comprehend either the relevance criteria or the document. And since some 70% of relevant assessments were not contested, a rough estimate of achievable overlap would set it at around 95%.

Meeting the supervising attorney's conception of relevance, though, is not the whole picture. The production ultimately has to be accepted as reasonable by the opposing side and by the judge, where there may be a genuine difference in conception of relevance, irreducible by careful weeding out of errors. Here, it is interesting to reflect that what Grossman and Cormack were able to do was to recreate the TA's conception of relevance from the criteria guidelines, even for contested documents. That is, the TA's conception of relevance was adequately externalizable, and in that sense objective (by which I mean that, given the guidelines, one no longer needed to refer to the TA's subjective reaction to individual documents in order to make reasonably accurate relevance assessments, though of course there will always be particular cases which the guidelines don't cover). (Unfortunately, as I've since learnt from Maura and Gord, they were aware at the time of their re-review what the TA's adjudication was. It would be interesting to rerun the experiment blind, without this knowledge, to see how accurately experienced, careful assessors are able to recreate the original conception.)

Now, such explicit criteria for guidelines are not (to my understanding) typically created where predictive coding (that is, text classification, usually with active learning) is employed; instead, the supervising attorney conveys their conception of relevance by directly labelling documents selected by the learning algorithm. As Gord Cormack has suggested to me, we can consider the supervising attorney's conception of relevance as a (developing) latent trait, which can be externalized on the one hand as a set of relevance labels, or on the other as a list of relevance criteria. And I'm assured by a least one prominent and technically-savvy e-discovery practitioner that the coherence of relevance and quality of error detection that such predictive coding systems offer is such as to make explicit relevance criteria (let alone full manual review) redundant. I'd be interested to see this last claim verified empirically (and I've got an idea of how one might do so).

But even if predictive coding makes stated criteria of relevance functionally redundant, then how is the conception of relevance employed in the production conveyed to the opposing side and justified to the court -- if not routinely (the presumption of good faith in such matters still pertaining), but when contested? And what is the basis upon which disputes over whether a certain document was unreasonably withheld or produced are resolved? And, finally, how are we to proceed if we doubt the competence or the good-will of the lawyer training the machine? Ralph Losey has argued to me that this assurance comes through transparency and cooperation between the requesting and producing party. Perhaps; but how much visibility does the requesting party have of the producing party's negative assessments? The vision of an externalizable and verifiable description of a conception of relevance still seems to me highly attractive.

Edit 2012-05-01: added link for Grossman and Cormack's DESI IV paper

14 Responses to “How accurate can manual review be?”

  1. Hello, I quoted the first two paragraphs but corrected the link to Grossman and Cormack to: http://jolt.richmond.edu/v17i3/article11.pdf.

    Fascinating work by all involved! Thanks for the post!

    Patrick

  2. "And since some 70% of relevant assessments were not contested, a rough estimate of achievable overlap would set it at around 95%."

    It has to be remembered that this was a qualitative study.

    That aside, of real significance was the stated overall recall accuracy rate of the first pass reviewers after TA and post-hoc analysis. The paper doesn't break out how the 90% post hoc agreement broke down (NR or RN), but that puts the recall error rate (coded NR but R) of first pass reviewers at ~ 1.8%-to-2.3% and the recall error rate after TA (second pass) review at .5%. And recall rate is the one that counts; recall is defensibility.
    Of course real second pass reviewers wouldn't re-code in the real world, but if they random sampled the NR pile, the error rate could approach the lower %.

    It would really be interesting to see the recall error rates if graded relevance levels were used.

  3. william says:

    Gerard,

    Hi! I'm not fully understanding your analysis. Do you have a more fully worked version somewhere?

  4. William,

    Sorry,
    Just saw this.

    From my read of the qualitative study:
    Only ~1300 recall disagreements (Table 1) arose between assessors (1st level review) and TREC participants (~co-assessors) out of ~42000. That’s only a ~3% “recall error rate”, the recall error rate being the rate at which assessors incorrectly coded relevant documents as non-relevant.

    The study states that ~85% of the participant objections were successful, bringing the assessor error rate down to about 2.4%.

    The study then states (p.3,5,9) that based upon a sample of the appealed documents 10% of TA assessments were overturned by the author. The study, as far as I could determine, did not state the relative proportion of original NR to R v. original R to NR that made up the set of document codes that were overturned. Therefore, I calculated a rough % range based upon the two extreme outcomes of max® overturned, and max(NR) overturned. That put it at a rough 1.8%-2.3% first pass error rate.

    I then calculated that based upon the author’s assessment of the outcome of the combined assessor/TA NR assessment (TA overall error rate ~10%) the combined assessor/TA process recall error rate probably drops to under 1%.

    Moreover, the study-assessor/TREC-assessor non-relevant coding agreement rate appears to be ~97%.

    I caveated all of this by saying that the test was qualitative and not designed to measure recall error rate. However anecdotal, it does fly in the face of the outcomes of the sparse research conducted in this area.

  5. gvc says:

    Gerard,

    Your math is incorrect. The recall of the first-pass assessors is calculated in "Technology Assisted Review in E-Discovery can be More Effective and More Efficient than Manual Review," which is, unlike the paper you cite, a quantitative study. The recall rates of the TREC first-pass assessments range from 40% to 80%.

    The flaw in your calculation is to assume that recall is equal to 100% minus what you call "recall error rate." It simply isn't.

    Gordon

  6. Gordon,

    I was observing that assessors, on the first pass, did a very good, not bad, job, as part of manual review, in identifying a large part of a small collection of relevant documents. In reviews, the results of initial QCs give good insight into initial overall "competency" at identifying relevant items.

    It seems to have been overlooked that in competent manual reviews, they represent only the first step in first pass review. Their errors do not define the manual review recall rate, and I believe it’s problematic to denominate their component error rate as the overall recall rate of a reasonable manual review. I’d argue that it’s not.

    Clearly, “recall error rate”, in hindsight, was a less-than optimal choice to describe this metric.

    While I agree that your “Technology” paper described the recall rate for assessors under the rules of the project, I believe that the Technology paper explicitly conflates assessor recall rate with manual review recall rate to draw an unsupported conclusion about the comparative performance of technology-assisted reviews --a conclusion that by now seems to have been accepted judicially as settled fact. I’d argue that it isn’t (and frankly shouldn’t be the litmus test).

    In the industry, competent manual reviews employ quality control procedures to mitigate error. Those that employ best practices include very good QA. This drives down error rates, as evidenced in the extreme form by the effect of participant re-reviews in the TREC project.

    Yet, Table 7* (Technology paper) submits the very worst manual rates, rates that would result from an incompetent manual review), devoid of any QC amelioration, and compares them to the recall rates generated by the best efforts of the technology-assisted approaches, replete with iterative refinement and QC processes, and benefitting from free rein to expend resources, (the Technology report is devoid of any metrics about labor or machine usage metrics although one approach seems to have expended almost as much manual review resources as the manual review itself).

    I'd argue that it would’ve been more valuable to calculate TREC manual review “best efforts” against technology-assisted best efforts. Meaning, for comparison purposes calculate and use the manual recall rates as indicated after corrections were made by the TA. Table 1 of the “Inconsistent Assessment” paper establishes that, measured post participant input, the manual review recall rate exceeded 95%. (Note: it’s my assumption based on statements made in the papers that the Inconsistent Assessment paper and the Technology paper utilize the sample set of documents, and that Table 1 of the former (with 205 and 206 added) is a more granular description of Table 7 of the latter. If not true, please advise). This was the main conclusion of my posts above.

    Therefore, the Technology paper describes a recall rate, but not the recall rate of a "manual review". In that sense, its conclusion about recall rates for manual reviews lacks research validity. It does not measure what it claims to measure.

    As an aside, the overall assessor/participant agreement levels as indicated in Table 1 of the Inconsistent Assessment paper also seems to contradict the prior findings cited by in the paper. They range from ~83% to 100%. The NR agreement rate is even higher.
    -G

    *Also muddying the waters is the fact that the two topics (205 and 206) for which manual review recall rates were very high were not utilized for a technology review comparison.

  7. gvc says:

    Gerard,

    The TREC assessors identified between 40% and 80% of the relevant documents. The professional reviewers used their normal QC processes.

    No QC process, short of a complete redo, is going to improve substantially on the recall rate of the first-pass review. How else would one identify the missed documents among the sea of 800,000 truly non-relevant documents?

    The reason that TREC gold standard was able to improve on the first-pass review was that a handful of independent teams each reviewed the entire dataset. That is, there were several independent "redos" so those 800,000 documents were scoured for relevant ones by several TREC participating teams, and those that were found were brought to the attention of the topic authority.

    The TREC first-pass reviewers examined only a sample of the documents in the collection. Statistical inference was used to extrapolate to how well they would have done, had they coded all 850,000 documents. The TREC participants, on the other hand, coded all 850,000 documents. The TREC participants examined between several thousand and thirty thousand documents -- an average of fifty times fewer documents than would have been examined in a the full manual review. The number of hours expended by the Waterloo team is reported in the paper to be 30 hours per topic, on average.

    Your suggestion that the TREC gold standard represents a manual best effort is preposterous. There's nothing manual about it: it incorporates the results derived from several technology-assisted reviews.

    I am not at all sure what you mean by "the overall assessor/participant agreement levels as indicated in Table 1." That table indicates appeals and success rates and I don't see any numbers ranging from 83% to 100%.

    While I agree strongly that it is valuable to repeat scientific experiments so as to confirm their findings, I believe that all the evidence points to the truth of the proposition that human reviewers aren't that good, and that machine learning and other methods can help humans to achieve higher recall, much higher precision, and much much higher efficiency, than can be achieved without.

    Gordon

  8. Gordon,
    I’d note that according to Table 1 (Inconsistent Assessment), assessors retrieved 95% of relevant documents for Topic 205 and 100% for Topic 206.

    That aside, from your explanation: “The professional reviewers used their normal QC processes.” I’m not sure who the professional reviewers are that you refer to. Are they the assessors? More fundamentally, and this is part of my objection: what were “normal QC processes” defined as, who conducted them and who defined them? As I’m sure you know, there is great variation in review QC practices (this may be the real objection to considering manual reviews the “gold standard”.) If there were internal QC efforts involving human supervisors and human reviewers, were these procedures and results documented?

    From your description ( and thanks for laying it out in a bit more detail), I have to assume that participants were actually technology solution providers who applied TAR methods (machine and human?) to the 800k documents to gather up relevant documents for each topic. Seven subsets (of roughly 7k documents) of R/NR were then selected and given to assessors – roughly 49k documents. (Therefore “appeals” were actually the occasions where machine and assessor differed? No?)

    My “best effort” claim was hyperbole, and yes, preposterous. However, I’d argue that using the output of the assessors without any established QA processes as equivalent to a well-run manual review, one to be compared against a TAR review, is only slightly less. In well run manual reviews, there are QC processes, and they are important. There is also real-time feedback from supervisors to alert reviewers to both nuance and issues.

    I don’t accept your claim that QC would not improve human review accuracy. Certainly, the studies conducted did not address this issue; so, it stands as an assertion without proof. Moreover, there are techniques to conduct coding errors that are significantly more effective than random sampling—admittedly these could be considered technology-assisted, but much less expensive than predictive coding processes.

    As for the agreement rate, I am basing it on Table 1. For example, of the 7377 documents that human assessors coded as non-responsive in T207, there were only 154 appeals, based upon “participants” coding those 154 documents as responsive. In only 123 of those cases did the human TA disagree with the original assessor. My interpretation is that puts the assessor/participant disagreement rate for those 7377 documents coded by both at ~2% and the assessor/TA disagreement rate for those documents reviewed by both at at ~1%. (For T207 R documents, it was ~4% and 2% respectively.) Unless we are to concede that both man and machine were concurrently wrong very often, and/or that the TA is error-prone (and the 90% TA/third-tier assessment agreement rate in “Inconsistent Assessment” argues against this), the agreement rate over the ~49k documents would appear to be much higher than expected, given prior claims.

    That being said, I commend the work of people who made the effort to put some numbers around this. And I wholeheartedly agree with you second conclusory point; machine learning methods will add great value to document review, while lowering cost. I do not agree with the claim that the jury has reached a verdict on human review/TAR review comparative performance based on TREC 2009. And I certainly worry when I see the assertion being treated as a settled scientific matter by the courts. What’s more I think that using it as some sort of litmus test for acceptance is misdirected, and a bit of a misdirection.

    The (hopefully) coming criteria for determining the acceptability of (any, including human) review methodology will revolve around robust metrics on performance, given the central goals of discovery.

    There, I believe machine-assisted methods will fare well.

  9. gvc says:

    I think we've come full circle. You are confusing accuracy with recall. It is easy to achieve high accuracy when there are few relevant documents. For TREC 2009, an assessor who simply said "not relevant" for every document would have achieved an accuracy of over 99%, because over 99% of the documents were not relevant. What's important is recall, the fraction of relevant documents correctly identified by the human review. The recall of the human reviewers was between 40% and 80%.

  10. No, the T205 and T206 percentages I stated above represent standard recall (retrieved relevant items/total relevant items), not accuracy. They are based upon the information described in Table 1 of the “Inconsistent Assessment” paper, which provides what appear to be the complete raw coding-appeal-TA determination results for T201-T207 .

    It’s my assumption that those numbers represent the same data used to generate the recall percentages stated in Table 7 of the “Technology Assisted” paper, which included the lower assessor recall percentages (~25% to ~79%) of T201-204, and T207, but not the assessor recall percentages for T205 or T206.

    To be clear, according to Table one of “Inconsistent Assessment” paper, for T205 there were 1631+50= 1681 R documents. 1631 were correctly initially identified by assessors; 50 of the 4289 documents that were coded by assessors as NR were successfully appealed to TA’s by participants codes as actually being R documents. The assessors recall was therefore 1631/1681, no? or 97%.
    In T206, there were 235 relevant items retrieved by assessors. None of the assessors 6860 NR codes were challenges, establishing assessor recall of 100%, no?

    The explanation given In the paper as to why T205 and T206 were disregarded:
    “In designing this study, the Authors considered only the results of two of the eleven teams participating in TREC 2009, because they were considered most likely to demonstrate that technology-assisted review can
    improve upon exhaustive manual review. The study considered all submissions by these two teams, which happened to be the most effective submissions for five of the seven topics. The study did not consider
    Topics 205 and 206, because neither H5 nor Waterloo submitted results for them. Furthermore, due to a dearth of appeals, there was no reliable gold standard for Topic 206.218 The Authors were aware before
    conducting their analysis that the H5 and Waterloo submissions were the most effective for their respective topics. To show that the results are
    significant in spite of this prior knowledge, the Authors applied Bonferroni correction,219 which multiplies P by 11, the number of participating teams. Even under Bonferroni correction, the results are overwhelmingly significant.”

    Q: how was it determined that the “dearth of appeals” in T206 was not the result of assessor accuracy? After all, T205, not chosen by either subject machine coder, did receive appeals.

    Q: did the two machine providers obtain results for T205 and T206, but not to submit them?

  11. Kate says:

    I am sorry; this is a pretty basic question: what is the definition of recall versus precision?

  12. william says:

    Kate,

    Hi! Recall is the proportion of relevant documents that are retrieved; precision is the proportion of retrieved documents that are relevant. Say there are 1,000 relevant documents in the collection; the review process produces 2,000 documents; and 800 of the produced documents are relevant; then we have 80% recall (800 of the 1,000 relevant documents are retrieved) and 40% precision (800 of the 2,000 retrieved documents are relevant).

  13. [...] supra, and Grossman & Cormack, Technology Assisted Review, supra. Also See William Webber, How accurate can manual review be? Again, the Jaccard index is formally defined as the size of the intersection, here 211, divided by [...]

Leave a Reply