Do document reviewers need legal training?

In my last post, I discussed an experiment in which we had two assessors re-assess TREC Legal documents with less and more detailed guidelines, and found that the more detailed guidelines did not make the assessors more reliable. Another natural question to ask of these results, though not one the experiment was directly designed to answer, is how well our assessors compared with the first-pass assessors employed for TREC, who for this particular topic (Topic 204 from the 2009 Interactive task) happened to be a review team from a vendor of professional legal review services. How well do our non-professional assessors compare to the professionals?

To answer this question, I'll take the official TREC qrels as gold standard, as Maura Grossman and Gord Cormack do in their paper comparing technology-assisted with manual review. These qrels are derived from the first-pass assessments after alleged errors have been appealed by participants and adjudicated by the topic authority (see the TREC 2009 Legal Track overview for more details). This topic authority is also the author of the detailed relevance criteria used by the TREC assessors and (in the second batch) by our experimental assessors in performing their assessments. We'll measure reliability using mutual F_1 score (also known as positive agreement) between the TREC or experimental assessors on the one hand, and the official assessments on the other. (Cohen's \kappa is an alternative measure, but I prefer mutual F_1 for the current purposes as it is more easily interpretable in terms of retrieval effectiveness. A discussion of the two measures can be found in Section 3.3.2 of our draft survey paper on Information Retrieval for E-Discovery.)

There's a slightly tricky question about sampling rates and gold standard reliability to consider also. (Readers not interested in statistical niceties can skip this and the next paragraph.) The sample for our experiment was designed to produce an even balance of officially relevant and officially irrelevant documents, since this provided the greatest statistical power for the comparison we were making (in that case using Cohen's \kappa). As it happens, this sampling also increases the concentration of appealed documents. So directly comparing results on only the experimental sample would be unfair to the TREC assessors. Instead, we perform a sample-based extrapolation from the experimental assessments to the sample drawn for assessment at TREC (though not to the full corpus, since this would increase estimate variance without fundamentally changing estimate accuracy).

A second nicety is in the handling of what I call the "bottom stratum", of documents retrieved by no TREC participant. This stratum was very lightly sampled from the TREC to the experimental sample (as it was from the corpus into the TREC sample), so (in extrapolation) each document here has a big impact upon comparisons of effectiveness. At the same time, no team had an incentive to (and no team did) appeal assessments of irrelevance in this stratum, so any false negatives (actually relevant documents assessed as irrelevant) will have been missed. Thus, including the bottom stratum potentially biases the comparison in favour of the TREC assessors. I therefore report results both with and without the bottom stratum considered.

The statistical prologemena done with, here are our results. The original experiment involved three treatments: a first batch with the topic statement only; a second batch with the detailed criteria of relevance; and then both batches jointly re-assessed by both assessors in a (mostly successful) attempt to reach agreement on relevance. Below, we report mutual F_1 scores with the official qrels for each batch. The TREC assessments were not (and cannot) be divided into the same treatments, so only the one F_1 score is reported for all three treatments. (Put another way, each batch provides an estimate on the full evaluation sample under different treatments). The TREC assessors worked independently to the topic authority's detailed guidelines; this therefore is the fairest comparison with our experimental assessors. I also show the unextrapolated F_1 score of the strongest participant team for this topic (the industry team included in Grossman and Cormack's analysis).

Here then, finally, are the results:

Document set Batch Assessor
Exp-A Exp-B TREC Team-I
With bottom stratum Topic 0.83 0.22
  Guidelines 0.73 0.26 0.33 0.89
  Joint 0.68 0.71
Without bottom stratum Topic 0.83 0.68
  Guidelines 0.75 0.59 0.35 0.89
  Joint 0.70 0.73

We can see that, if the bottom stratum is excluded, both our assessors (Exp-A and Exp-B) outperform the TREC assessor. For each of the two batches, Exp-B found 2 officially irrelevant documents to be relevant, so if this stratum is included in the comparison, Exp-B's score is depressed to below that of the TREC assessor (note the strong sampling effect in this stratum), though assessor Exp-A's and the joint-review score is unaffected. It was noted above that including the bottom stratum is potentially biased in favour of the TREC assessors, though it should be said that for the four documents in question, our assessors jointly concluded that they weren't in fact relevant (and, having viewed them myself, I concur).

I've not checked the above results for statistical significance. Given the sampling complexities involved, that would be a tricky thing to do. My suspicion is that the differences without the bottom stratum may be significant, but with the bottom stratum likely are not, due to the much greater sampling variability in the latter case. More to the point, though, this is only a single topic, in an experiment not directly set up to measure the comparison made here (and where even the official assessments might contain some remaining errors or ambiguities); even a finding of statistical significance would not show that our assessor A is "provably more reliable" than the TREC assessors. There are also differences of conditions involved; our assessors worked three-hour shifts, assessing around 60 documents (assessor A) to 30 documents (assessor B) an hour, which are less fatiguing conditions than perhaps are typical in large-scale manual review (though the conditions of the professional TREC review are not specified in the TREC overview document). The conclusion that can be reached, though, is that our assessors were able to achieve reliability (with or without detailed assessment guidelines) that is competitive with that of the professional reviewers -- and also competitive with that of a commercial e-discovery vendor.

Time to meet our crack reviewers:

Intern-reviewer Bryan

Intern-reviewer Bryan

Intern-reviewer Marjorie

Intern-reviewer Marjorie


At the time of the experiment, Marjorie and Bryan were high school seniors working in the E-Discovery lab as interns four mornings a week. (Since then they have become ex-high-school students waiting for their summer holidays to end and their studies at the University of Maryland to commence.) They have no legal training, and no prior e-discovery experience, aside from assessing a few dozen documents for a different TREC topic as part of a trial experiment. They performed assessments on TIFF files, displayed on rather underpowered laptops that couldn't fit the full TIFF onto the screen and tended to crash every now and then. They worked independently and without supervision or correction, though one would be correct to describe them as careful and motivated.

All of this raises the question that is posed in the subject of this post: if (some) high school students are as reliable as (some) legally-trained, professional e-discovery reviewers, then is legal training a practical (as opposed to legal) requirement for reliable first-pass review for responsiveness? Or are care and general reading skills the more important factors?

As it happens, the same question has been addressed in a couple of other recent studies, which reached similar results to our own, and thus add some support to our tentative finding, namely: that legal training does not confer expertise in document review. In A User Study of Relevance Judgments for E-Discovery (Proc. ASIST, 2010), Jianqiang Wang and Dagobert Soergel had four law school and four library and information studies (LIS) students re-assess documents from four TREC topics. In an exit interview, the law students stated that their legal training and experience was helpful in performing their assessments, whereas the LIS students felt such experience would not have helped much. The LIS students appear to have been correct: there was little or no difference between the two assessor groups in reliability or speed. In another study, Legal Discovery: Does Domain Expertise Matter? (Proc. ASIST, 2008), Efthimis N. Efthimiadis and Mary A. Hotchkiss had six groups of MLIS students build a run for the TREC 2007 interactive task. Two of these groups consisted of law librarianship students with JD degrees and professional experience as lawyers or legal searchers; but they achieved no greater reliability than the remaining four groups, who had no legal experience.

Some TREC 2009 topics were initially assessed by law students, other by professional reviewers. The difference in expertise level can be considered here, though with two caveats: first, we're comparing assessors on different topics, and variability in the assessment difficulty of topics appears to be high (see Table 3.5 of our draft survey); and second, we don't know how the professional reviewers conducted their reviews, and there could be differences in process as well as raw expertise. Of the five topics in the TREC 2009 Legal Interactive task that were appealed with sufficient thoroughness for the us to be confident that most errors were found (see my SIRE 2011 paper, Re-examining the Effectiveness of Manual Review, for a fuller discussion), three were initially assessed by professional review teams, two by volunteer law students. The picture of the benefits of expertise is mixed for this dataset. If we consider only the documents actually assessed (without making a variance-increasing extrapolation to the corpus), then the law students seem roughly as reliable as the professional reviewers (Table 4 of Re-examining), though if we do extrapolate to the full population, one of the professional review teams does outperform the two student and other two professional teams in both reliability (ibid., Table 2) and consistency (ibid., Figure 2) -- though perhaps this is explained by good process (and care in recruitment), rather than greater legal training.

Now, even if we take at face value the finding that legal training does little to improve reliability in first-pass document review, this does not mean that there are not systematic differences in the skills of reviewers (some will be more careful and attentive readers than others, for instance), or the accuracy of professional review teams (depending upon the quality of their processes and their care in recruiting said skilled reviewers). I'd not recommend rounding up high school students on summer break to save review costs. And of course legal training and expertise are required to frame, monitor, and interpret the production. Our finding might suggest that first-pass document review is not practice of the law (since it apparently does not employ legal skills), and therefore that using non-lawyers to perform it should not be considered unauthorized practice of the law, but I'll offer a bottle of wine to the lawyer brave enough to argue that in court on the basis of this blog post. Perhaps in the construction of e-discovery test collections, we do not have to insist that all reviewers have legal training, though again this depends upon the acceptability of this to the practicing e-discovery community. The research implications are clearer: we don't know what makes a reliable reviewer; legal training doesn't appear to play a major part, which is a negative outcome; but that in compensation means that findings on the conception and perception of relevance from outside the e-discovery domain can with more confidence be applied within it.

11 Responses to “Do document reviewers need legal training?”

  1. Ryan Calef says:

    I found your article a bit disturbing. You suggest that Document Reviewers whom go to law school, pass the bar, are basically producing the same level of work as a bunch of High Schoolers?

    I know a number of these people. They were trained by law schools, and left out into a world that was overflowing with attorneys. This type of mockery will end up hurting them. Perhaps you should consider these consequences when doing these antics. Some were not as privileged as yourself.

  2. David Shalev says:

    The author clearly is not familiar with the scope of document review on different projects. The simple the project the more likely that said high school students might do better. Some reviewers are bound to make mistakes doing 10-12 hours of routine clicking. But there are other projects which have a more substantive component. I have been on projects where we had to read Federal case law with legalistic reasoning and language. I suggest the author expand his sample size and see if the high school students can tackle these judicial opinions.

    Lastly, MANY professions could be done in part or in whole by high school students. I am sure there are high school students that can do lower level computer programming. There are many high school students who can do simple accounting work on excel.

    Part of the licensing process addresses not just the intellectual capacity to do a certain task but also the professional's adherence to a certain code of conduct and confidentiality.

    It seems like the author would do well to go to law school as his reasoning skills seem to be lacking.

  3. william says:

    Certainly there will be cases in which legal skills are very necessary for review, most particularly where the documents to be reviewed are legal in nature (contracts, for instance, or other agreements typically drawn up by lawyers). But then there will be cases in which other professional skills than legal ones will be required. For instance, if the case involves the structuring of complex financial deals, then financial or accounting skills would be desirable. But even here, I imagine that what you want is not for each individual reviewer to be making their own, possibly divergent judgments based upon their personal expertise in a matter, but carefully following the directions of and engaging with a subject matter expert overseeing the process. And to do that, good reading skills, good concentration, and reviewing experience may be more valuable than legal training.

    I'm sensitive to the fact that I'm citing a handful of small studies to comment upon practice in a multi-billion dollar industry. Large review vendors have extensive experience at running review processes and assessing reviewers, plus the data and resources to do the latter at a scale that academic studies cannot hope to match. If you want to run a review properly, and more particularly defensibly, talk to an experienced practitioner, not me. But existing practice is not always based upon sound evidence; experience can serve to reinforce existing assumptions, even if they're not correct; and vendors aim at profitably providing a service, not objectively and openly assessing the service they provide. It is therefore valid for academic research to investigate assumptions upon which professional and industrial practice is based. The findings of the studies I've cited (including our own) are merely suggestive; but what they suggest is that the assumption that legal training is necessary in general for reliable linear document review may not be correct.

  4. Danny Calegari says:

    Care and general reading skills are important; but it takes years of legal training to come up with insights like "The simple the project the more likely that said high school students might do better." Why bother with statistical analysis at all? Leave all that to the "Document Reviewers whom go to law school."

  5. Danny Calegari says:

    Oh, and you misspelled "prolegomena" . . . (programmer's mistake?)

  6. David Shalev says:

    Danny,

    I am sure you relish the ability to point out my omittence of the letter "r" in "simpler". Might I suggest that the reason is the amount of free time you have to proofread anonymous fast-typing commenters on a blog? I am sure you are proud of yourself. The rest of your post is purely an ad hominem attack which I won't dignify with a response.

    William,
    I find it hard to believe that all things being equal, high school students beat lawyers in document review. It's fair to say that the pool of people who have done well on the LSAT and bar exam, have better reading (and probably reasoning skills) than the pool of people who merely attend high school. One important skill in document review and the abovementioned exams (LSAT and bar exam) is speed reading. The tests are deliberately designed to test how quickly one can read and complete a maximum number of questions in a minimal amount of time.
    What you are basically undermining is the whole area of standardized testing and its relation to intellectual capacity, speed of information processing, etc.... As I mentioned previously, the people who take the SAT in the first place are in the higher end of their academic pool, and so goes for those who take the LSAT and pass the bar exam.
    There are just too many variable to make an accurate assessment at this point - some document review centers admittedly have an overly laxed environment, low quota of documents per hour (low productivity, technical performance issues with the vendors, etc.... Even a high quota of documents per hour will inevitably lead to an emphasis on speed over accuracy.

    EVEN IF all your assumptions were true, which I am confident they are not, you failed to address the issue of licensing requirement. Of course you are entitled to believe that we should do away with licensing. But as of now, we use degrees and licensing requirements as a signaling mechanism that an individual has gone through a certain amount of vetting for intelligence, character, ability, adherence to a level of conduct, etc.. The idea that a random high school student woulf be able to do certain types of legal work is certainly possible. But the same could be said for unlicensed high school barber, accountant, plumber (there are literally thousands of certifications that I can mention).

  7. Danny Calegari says:

    Dear David - thank you for dignifying my post with a response. I understand that a fast-typing commenter like you doesn't have much time in your schedule to spare for ad hominem attacks, and I'm flattered that you were able to make some for me!

    William - I am pretty impressed to hear that you are basically undermining the whole area of standardized testing and its relation to intellectual capacity, speed of information processing, etc. Not bad for a boy from the antipodes!

  8. Danny Calegari says:

    Sorry, I can't help myself - "omittence" is priceless . . .

  9. william says:

    David,

    I don't doubt that you are correct that the average lawyer is better at document review than the average high school student. Indeed, I'd go further and hypothesize the following relationship in terms of review skill: average lawyer > average graduate-trained professional > average high school student. But I'd base this hypothesis primarily on the reasons you yourself ascribe: lawyers are (partly self-) selected for good reading skills; and good reading skills are important for good document review. This would be consonant with the findings of the studies that I cite in my post: the non-legal reviewers being compared with legally-trained reviewers are also likely to have good reading skills (advanced-placement high school students; library science graduate students). What I'm questioning is whether there is some specific technical skill that lawyers obtain through legal training (or through legal experience) that is necessary for reliable document review, above and beyond those of a literate and attentive person in general.

    Let me give an example from another domain. Surgeons will generally have good manual dexterity (a general skill), as well as training in anatomy and surgical technique (a technical skill). The average surgeon is probably better at making model aircraft than the average person. But you don't need to be a surgeon to be good at making model aircraft; other people with good manual dexterity will be as good or better. On the other hand, you wouldn't want anyone without surgical training to perform surgery, no matter how good their manual dexterity was; there is a definite technical skill one requires. If we organized an experiment in which we pitted highly dextrous high school students against run-of-the-mill surgeons at open heart surgery (admittedly a tricky experiment to get ethics clearance for), I hypothesize we'd see strong and statisticially significant differences in the outcomes between the two groups. Therefore, it is a good idea that we only allow licensed surgeons to perform surgery.

    Now, there are forms of legal practice for which a definite legal technical skill is required, a skill obtained through legal training (broadly conceived). The legal technical skill here, I suppose we would say, includes familiarity with case law, understanding of legal processes and vocabulary, and so forth. These technical skills go beyond general skills in reading, writing, memory, and public speaking. So if we pitched highly literate, well-spoken high school students with good memories against trained and experienced run-of-the-mill trial lawyers in court cases, we'd expect the latter to win a significantly higher proportion of cases. So, it is a good idea only to allow licensed lawyers to represent clients in court (and to perform other legal roles that require this technical skill).

    What I'm suggesting in my post, however, is that there is no such technical legal skill required for first-pass document review (except, perhaps, in cases where particularly technical legal documents are being reviewed). This proposition seems to be supported by the experiments that have been performed so far, as limited as they are. Certainly, we have not identified all the variables that affect the quality of document review. But if there were a clear technical skill required, then we would have expected this to appear in the experiments that have been done. (Even anecdotal experiments would be enough to detect a clear technical skill in surgery.) So while a legal education may be a signal for good reviewer ability, it does not appear to be a requirement for it. If so, the general reading skills which (I hypothesize) are at the heart of reviewer ability can be determined by other testing means. Whether this means there should be a legal licensing requirement for document reviewers is not for me to say.

    William

  10. JP Carlson says:

    A couple of comments. There should really be no debate that law firms are looking for ways to decrease costs for their clients, and one way of doing this is to ditch paying attorneys to do doc review in large multi-million dollar lawsuits. If it can be shown that anyone with a high school education can perform doc review properly, then this becomes a minimum wage job. And that saves clients money.

    I have a problem with this on many levels and I see that others share my concerns.

    To begin with, I will wager that if I go to my local dry cleaner, fast food chain, nail salon, gas station, health club, convenience store, etc., etc. and pick four people randomly to do doc review, the results will be poor. These folks are friends and neighbors so I have no interest in denigrating their talents and abilities. But the fact is that when we make document review a minimum wage job, we open the door to anyone and everyone to become a document review-person since it apparently takes no special skill. Does this make sense in high stakes litigation involving millions of dollars? Is this what we really want? Does anyone really believe this?

    I must say that the concept of identifying competent document review-people with a test is interesting, but it will never happen. Market forces will insure that it never does regardless of the need for it. If this idea takes root, there will be no turning back. And it will take some significant malpractice lawsuits to reverse it.

    Next, this is a very slippery slope. There are businesses online that enable contractors to fill out and file mechanic's lien forms. When we in the profession take the position that the tasks we perform require no special ability, we harm our profession. We are already in an era where young people running businesses who cannot afford attorneys find the contracts and forms they need online and fill them out without the assistance of attorneys. This fuels the perception among laypeople that attorneys are just people with a form library.

    I believe that this perception - that anyone can do what we do and that we are not worth the money we charge - will ultimately be the undoing of the profession. And this is a shame for many reasons.

  11. [...] William Webber, author of the research blog IREvalEtAl, conducted a study to see how well high school seniors stacked up to legally trained document assessors. Statistically speaking, they are remarkably similar in productivity and efficiency says Webber. To read about the experiment and the detailed results, go to http://blog.codalism.com/?p=1609 [...]

Leave a Reply