Annotator error and predictive reliability

There has been some interesting recent research on the effect of using unreliable annotators to train a text classification or predictive coding system. Why would you want to do such a thing? Well, the unreliable annotators may be much cheaper than a reliable expert, and by paying for a few more annotations, you might be able to achieve equivalent effectiveness and still come out ahead, budget-wise. Moreover, even the experts are not entirely consistent, and we'd like to know what the effect of these inconsistencies might be.

Ralph Losey advocates the use of a single subject matter expert for training, while Esuli and Sebastiani report alarming falls in classifier effectiveness with even mild annotator error rates. In contrast, Jeremy Pickens of Catalyst finds only a marginal loss in effectiveness with non-expert annotators, correctable with moderate quality control, while Scholtes et al. report only slight degradation even at quite high annotator levels. Elsewhere, Jeremy and I find the annotator error effect to be moderate but real.

Given these discrepant results, it might be interesting for us to run some experiments on this question here. What we would ideally like is a dataset with relevance assessments that are both complete and repeated, but unfortunately no such public dataset exists (at least that I'm aware of---please let me know if you know of one), so we're going to have to choose between completeness and repeated assessments. My work with Jeremy, and Jeremy's work at Catalyst, used datasets with repeated judgments, but only a small proportion of the complete collection judged. Here instead, I'm going to follow Scholtes et al. and Esuli and Sebastiani in taking a completely but singly assessed collection---the trusty RCV1v2 test collection---and simulate assessor errors. Simulation is obviously artificial, but it has the advantage that we can experiment with different types and degrees of (artificial) errors, and see what effect they have. The RCV1v2 collection of hierarchically categorized news articles is also cleaner and more homogeneous that the sort of data one normally encounters in e-discovery, but this in fact makes it a less noisy dataset to work with experimentally.

The classifier we'll use will be the Vowpal Wabbit system, version 7.4, with the following learning parameters:

--learning_rate=2 --passes=1 --loss_function=logistic --l1=0 --l2=0

I've done a bit of parameter searching to determine that these parameters do reasonably well on this data set. Note that you'll get quite different results if you used VW version 6.X, as the scoring function has changed between the versions.

Let's start with a very simple model of annotator error, in which for each document the annotator assesses, they have a 1 - E probability of giving it the correct label, and a E probability of giving it the incorrect one. We'll call this the "fixed error rate E" model. The probability of error is independent of document contents and correct label, and also of assessor decisions on other documents. In essence, we have a perfectly expert assessor who, 100 * E% of the time, slips and hits presses the wrong button in registering their assessment. This is obviously an unrealistic model of how assessors actually make errors, but it is pretty much the simplest model one can think of, and so is a natural starting point.

We train with 20,000 randomly selected documents, each of which has its label flipped with varying error rates E, and test against 50,000 other randomly selected documents (with error-free labels). (Results with 5,000 and 10,000 training documents are not markedly different). For each error rate, we repeat the document flipping 5 times, and average across these randomizations; however, we use the same 20k training, 50k testing split throughout. We'll use depth for recall R as our metric; that is, however far we have to go down the predictive ranking produced by the classifier to achieve recall level R.

The RCV1v2 has 103 topical categories. Let's begin by looking at the change in effectiveness scores for the following three topics, selected to represent a range of different prevalences:

Label Description Prevalence
M14 Commodity Markets 10.6%
C171 Share Capital 2.3%
G158 Consumer Prices 0.5%

The depth-for-recall results with increasing error rates on these three topics look like so:

Depth for recall with increasing annotator errors

Depth for recall with increasing annotator errors

The degradation in effectiveness with assessor error varies considerably between the three topics. A 10% error rate increases the depth for 80% recall on Topic G158 from the error-free depth of 8.6% to 46.1%, and of Topic C171 from 4.2% to 19.6%, roughly five- and four-fold respectively. In contrast, the depth for recall on Topic M14 increases much more modestly from 8.7% to 9.9%, and it is only when assessor error exceeds 20% that the depth for 80% recall doubles.

These per-topic results suggest that higher-prevalence topics suffer less from (our simulation of) annotator error than low-prevalence; and indeed this relationship generalizes across the 103 topics in the RCV1v2 dataset. Grouping the topics by prevalence and taking the median the multiple of depth for recall with error level over error-free annotation, we get the following figure:

Median multiple of depth for recall by error rate, group by topic prevalence.

Median multiple of depth for recall by error rate, group by topic prevalence.

Topics with prevalence above 5% don't generally see depth for recall double until the error rate is around 20%, whereas topics with prevalence prevalence below 5% see depth double with error rates of 5% or less. (The anomalous behaviour of topics with prevalence below 0.3% is because their error-free depth for recall 80% is so deep---a median of 32.6%---that there's not room for it to degrade too far.)

Striking as these results seem, we have good reason to suspect that they are an artifact of the simplistic way we are simulating errors. An error rate of 10% for a topic with prevalence 10% will lead to a training set as follows:

  • 9% labelled and actually relevant
  • 9% labelled relevant but actually irrelevant
  • 1% labelled irrelevant but actually relevant
  • 81% labelled and actually irrelevant

whereas the same error rate for a topic with prevalence 1% will lead to the following training set:

  • 0.9% labelled and actually relevant
  • 9.9% labelled relevant but actually irrelevant
  • 0.1% labelled irrelevant but actually relevant
  • 89.1% labelled and actually irrelevant

That is, in the low prevalence case, the (true) positive examples get swamped by false positives. And the coupling of varying prevalence and fixed error rate is unrealistic. One would expect that for topics where relevant documents are rare, documents that falsely appear relevant (to an inexpert or careless assessor) will also be rare.

Thus, all we can hypothesize so far is that:

  1. The effect of a given level of assessor error depends (among other things) upon the prevalence of positive examples in the training set.
  2. At least for purely random errors, having up to around half your training examples be false positives generally leads to less than a 50% increase in depth for recall.

We'll probe both these hypotheses with more flexible, and realistic, error models in later posts.

Leave a Reply