Assessor error in legal retrieval evaluation

Another year, another CIKM. This marks my first post-PhD publication (I finally submitted!), and it also marks a new sub-genre of retrieval evaluation for me: that of legal retrieval, or more specifically e-discovery. Discovery is a process in which party A produces for party B all documents in party A's position that are responsive to a request or series of requests lodged by party B. These requests most saliently take part during civil litigation, where party A is gathering evidence to sue party B, but they can also occur as part of governmental regulatory oversight, and I guess more broadly whenever one party has the legal right to access information in the possession of a second, potentially adversarial party. E-discovery is the extension of the discovery process to electronically stored information, which brings with it on the one hand a greatly increased volume of documents, while in compensation on the other hand the potential to use automatic tools as part of the production process.

Retrieval in e-discovery is very different from retrieval through web search. Web searches are generally the work of a single person, are frequently ephemeral, addressing an information need that is hardly ever explicated and often is inchoate in the user themselves, and ending in the subjective satisfaction or frustration of the user. E-discovery, on the other hand, is a process involving dozens of experts working for several interested or supervisory parties, taking place over many weeks, frequently costing millions of dollars, and performed in an explicit, negotiated, and documented way. Additionally, while the web searcher is typically satisfied with one or a handful of documents providing the particular information they were after, e-discovery aims to find (within reason) all documents that are related to a case. In the traditional jargon of retrieval evaluation, web search is precision-centric, e-discovery recall-centric.

E-discovery, then, is like the mediated, formalized, "serious" information retrieval of the good old days writ large. And besides it attractions to the nostalgic, e-discovery is a very sympathetic field for investigations into retrieval evaluation. Whereas success in web search is subjective, neither formally measured by the user nor definitively remarked by the search engine, in e-discovery, measuring the success of the retrieval process is an integral part of the process itself, one that is being increasingly stressed by case law. Therefore, the techniques developed in the experimental, collaborative or laboratory evaluation of e-discovery should inform the quality assurance and certification methods of e-discovery practice.

The contribution we are offering at CIKM this year concerns measuring assessor error, particularly in the Legal Track of TREC. An e-discovery process is supervised by a senior attorney, whose conception of relevance is authoritative. The role of the senior attorney is played in the track's interactive task by a topic authority (TA), themselves a practicing attorney in real life. Actual relevance assessment is performed by multiple volunteer assessors, but the assessors' role is to apply the TA's conception of relevance, not their own. Therefore, whereas other evaluation fields talk of assessor disagreement, in the legal track we can talk of assessor error. Moreover, this assessor error is directly measurable, by referring assessor results for adjudication by the TA, something which is currently done via a participant-instigated appeals process. And in an environment where we care about absolute, rather than merely relative, measures of system performance, and particularly of system recall, assessor error can be seriously distorting to our evaluation outcomes -- much moreso generally than the better-understood effects of sampling error.

The presence of the TA, the parallel assessment of randomly partitioned document bins by different assessors, and the outcomes of the appeals process, all produce what seems like a wealth of evidence for measuring and correcting assessor error. As we show in our paper, however, none of this evidence is in a form that is directly useful to us in correcting our measures of absolute performance (or at least we can't see how it is; you are more than welcome to try your hand at it yourself). Instead, we propose that a double sampling approach be used, the theory of which we describe, and indicative results of which we produce -- empirical work remains to be done on the application of double sampling to a complete retrieval evaluation.

On a more general note, it has become customary for me when describing conference papers to animadvert on the bizarreness of conferences as a mechanism for transacting academic business in the internet age. The death-march grind of CIKM does not begin till tomorrow, so I should keep my powder dry. I will only say at this point that while Toronto seems on brief inspection a very attractive city, no-one should have to go through Los Angeles airport in order to have their research ideas heard.

4 Responses to “Assessor error in legal retrieval evaluation”

  1. This is a great idea and I can't wait to see the result.

    Ever since I started annotating data myself, I've been trying to convince others to accept that the "gold standards" out there are not only impure, but not necessarily refinable.

    The strategy you propose, adjudication by a "task authority", assumes that there exists a well-defined notion of relevance. In my experience, real data often resists clean categorization, even by the person who's definining and driving the task.

    I've seen this for gene and protein names in text [what exactly counts as a gene reference], sentiment classification [can't tell if review is positive or negative], coreference [dangling pronoun with unclear antecedent], and other NLP-related tasks.

  2. william says:

    You're certainly right that gold standard in relevance and other human assessments are not fixed. In e-discovery, though, we do have a hierarchy of assessment authority, with the algorithm deferring to the reviewer, the reviewer to the responsible attorney, and the responsible attorney to the judge. So there are two distinct problems. The first is, verifying that the reviewer has applied the responsible attorney's considered conception of relevance. The second is, checking that the responsible attorney's conception of relevance is indeed considered and stable. For instance, we could try to detect where the responsible attorney gives two similar documents disparate assessments, or (more statistically) where the attorney's proportion relevant varies over time. The former question is the one we've begun to tackle with our CIKM paper; the latter, we've hardly started to think about.

  3. It was amusing to start this conversation in person yesterday and only after going over the first moves discovering we knew each other only by blog titles.

    For the blog record, I was referring to what are actually borderline/difficult/impossible cases to classify, not to inconsistency or concept drift, though I do believe the borderline cases are the ones most likely to be inconsistently labeled or to drift under coding standard revisions or adjudications.

    A related issue is that for the exact same counterparty request, different attorneys will have a different decision boundary, perhaps determined by many different document features. Judges (in the legal sense) do the same, and the attorney may even have knowledge of the judge's preferences in making their own decisions.

    In the TREC legal track case, my concern would be about documents where the responding attorney was truly uncertain about whether the judge would consider a document responsive. And the case where five different choices of responding attorneys and judges would result in five different opinions about relevance. The cases on which they disagree would be the truly fuzzy or uncertain ones. That is, ones you can't judge based solely on the counterparty request itself.

    I'm guessing that whether an attorney includes such a borderline document would depend on their assessment of the relative risks of non-response versus inadvertent disclosure.

  4. william says:

    Hmm, that is an interesting point. I'd say that for individual documents the risk equation is heavily weighted in favour of production: the potential penalty for missing a relevant document is so much greater than for including an irrelevant one that if there were any doubt, you'd include the document. In practice, though, the overseeing attorney is dealing as much with classes of documents as with individual documents: they are giving their reviewers directions as to what to regard as relevant, and if they are adjudicating individual documents, it is as an exemplar of a class of documents. Here, there is a real danger of classification drift; once you start shifting borderline classes of documents onto the relevant side of the border, you can end up with an explosion of relevance and gross over-production.

    Another thing to point out is that the real review process is an iterative one, although this is not well captured by the current setup of the Legal track. The process will often start off with the exclusion of broad, "clearly" irrelevant swathes of documents from the data dump that winds up on the review team's desk (holiday snapshots, mass distribution emails, etc.); this stage is often overseen by cheap, inexpert reviewers. Then junior attorneys will work through the remaining material, pulling out document groups that are possibly relevant, and categorizing them into case issues. Next, more senior attorneys will review these documents for actual case preparation. At which point documents are handed over to the counterparty depends upon the legal system in operation: in the US system, one hands over relevant documents only; in the British/Commonwealth one (if I understand correctly), one hands over "everything" except privileged documents, and it is up to the other side to categorize things. Anyway, the point I'm labouring towards is that not only are different decisions made at each stage of the process, but that teams want the ability at a later stage to modify decisions made at an earlier one; if the senior attorneys somehow discover that some of the holiday snapshots may be relevant, they want to be able to re-review the snapshot set for relevance, without having to redo the entire review stage.

Leave a Reply