Relevance density affects assessor judgment

It is somewhat surprising to me that, having gone to the University of Maryland with the intention of working primarily on the question of assessor variability in relevance judgment, I did in fact end up working (or at least publishing) primarily on the question of assessor variability in relevance judgment. The last of these publications, "The Effect of Threshold Priming and Need for Cognition" (Scholer, Kelly, Wu, Lee, and Webber, SIGIR 2013), was in some ways the most satisfying, for the opportunity to collaborate with Falk Scholer and Diane Kelly (both luminaries in this field), and for the careful experimental design and analysis involved.

The question we set out to ask was, in a nutshell, this. If you show relevance assessors a sequence of highly relevant documents, will they tend to be more picky in the documents they consider relevant? Conversely, if you show them a sequence of irrelevant documents, will they become more liberal in their conception of relevance---willing to accept any hint of a responsive keyword as enough to push the document into the relevant class?

The short answer to the nutshell question is, yes: assessors (at least our assessors---more on that caveat later) shown a greater volume of highly relevant documents tended to raise their relevance threshold; those shown a lower or no volume tended to lower it. More exactly, we divided documents into four levels of relevance: not (0), marginally (1), simply (2), and highly (3). We created three different prologues of documents, containing highly (H), moderately (M), or no (L) relevant documents, and randomly assigned assessors to prologues. All assessors assessed the same epilogue of documents (for three different topics). We then observed how assessors seeing a given prologue tended to assess documents with a given true level of relevance in the epilogue. The money figure is this one:

Mean relevance assigned to documents of actual relevant levels (0, 1, 2, 3), depending upon prologue treatment (High, Medium, or Low density of relevant documents)

Mean relevance assigned to documents of actual relevant levels (0, 1, 2, 3), depending upon prologue treatment (High, Medium, or Low density of relevant documents)

Note that there's relatively little difference in the treatment of actually irrelevant documents (relevance level 0). However, assessors who saw no relevant documents in the prologue (treatment L), or only weakly relevant documents (treatment M), tended to up-weight the relevance of marginally or simply relevant documents they encountered in the epilogue. There's much more in the article---if this snippet has piqued your interest, then please read the whole thing.

As is often the case with human-factors experiments of this sort, the precise degree of effect should not be taken as fixed and definitive, since it will vary with conditions (documents, topic, assessors, and so forth). In fact, even the direction of effect might differ under different circumstances. One prominent e-discovery practitioner I discussed the results with was surprised the trend wasn't in the other direction; that is, that assessors seeing an extended sequence of irrelevant documents would stop looking carefully and tend just to mark everything irrelevant; and indeed, under different circumstances, that outcome might occur (our experimental subjects only reviewed 48 documents over a period of an hour, which is far from a fatigue-inducing amount of time and effort in e-discovery terms). The key point is the demonstration, under experimental conditions, that an effect does occur. That is, how assessors judge a document for relevance depends upon the documents surrounding it.

The effect of document context on relevance assessment is a consequential one for e-discovery, and particularly for sampling and estimation of production completeness. Say, for instance, we wished to estimate the recall of a candidate production by sampling from the produced and the null sets, and having the sampled documents assessed for responsiveness. The above experimental result tells us that having the produced documents assessed in one set, and the unproduced documents in another, is a terrible idea: the greatly different densities of relevant documents in the two sets are likely to induce different relevance thresholds, and seriously bias our estimate of recall. Instead, the sampled documents should be randomly mixed and assessed together, to maintain a consistent document context. (The same goes for other assessment conditions, too, included the assessors used.)

Even more fundamentally, we should not assume that just because a human has made a relevance assessment, that assessment is correct. Human assessors are subject to biases and errors; those biases and errors can be systematic; the conditions under which the assessments occur can influence the biases and errors; and many aspects of those conditions are under our control. E-discovery practice is becoming increasingly sophisticated in the quality control of the automated components of the process. A similar level of care needs to be taken in the quality control of the human components, too.

4 Responses to “Relevance density affects assessor judgment”

  1. Typo: "tend to me" (2nd paragraph) should (?) be "tend to be".

  2. william says:

    Thanks, fixed.

  3. SK says:

    Hi William,

    What was the time delay on showing each document on the sequence of showing relevant documents? If you have small delay in showing relevant documents, then I think this effect could be temporarily not permanent. If this is correct then the relevance threshold increase rate is temporarily which is a great observation for web search engine dealing with users. But for assessors, they can come tomorrow and the relevance threshold stay the same as starting point for today. If my understanding is correct, then is there any way that we can say the users can assess K number of documents without effect of assessor threshold flexibility in a specific period of time, say daily?

  4. william says:

    All assessment was done within a one-hour session, so there was little delay between relevant documents. I agree that it is likely that the threshold would be to some degree "reset" if the assessors were to take an extended break in the middle of reviewing. But then you would also have other issues of consistency arising from such a break.

Leave a Reply