Can you train a useful model with incorrect labels?

We, in this blog, are in the middle of a series of simulation experiments on the effect of assessor error on text classifier reliability. There's still some way to go with these experiments, but in the mean time the topic has attracted some attention on the blogosphere. Ralph Losey has forcefully re-iterated his characterization of using non-experts to train a predictive coding system as garbage in, garbage out, a position which he regards Jeremy Pickens and myself as disagreeing with. Jeremy Pickens, meanwhile, has responded by citing Catalyst experiments on TREC data that show (remarkably) that a model trained even entirely with incorrect labels can be almost as useful as one trained by an expert.

The key point to bear in mind when considering this issue is the relative primitiveness of the text classification algorithms that underly predictive coding systems. They work, essentially, by identifying vocabulary that tends to occur in responsive documents, and distinguishing it from vocabulary that tends to occur in non-responsive ones. The algorithms are not able to learn the fine-grained semantic distinctions that might separate an expert from a non-expert in their understanding of documents and their legal import. Now documents that are responsive will often contain similar terminology from documents that look responsive to a non-expert, but in fact are not so. The degree to which this is true will depend on the case and the collection; if, for instance, there is a specialist vocabulary that identifies responsive documents, then the expert-trained system may indeed substantially outperform the novice-trained one.

Another issue (and I owe this observation to conversation with Jeremy Pickens) is that non-expert annotators tend to be more liberal in their judgments of responsiveness than expert ones. This can help a predictive coding system, because the system tends to learn at the rate at which it sees minority-class examples, and responsive documents are in the minority class. That is, even if the non-expert marks some actually non-responsive documents as responsive, if they nevertheless bear the superficial characteristics of responsiveness, this may help the system learn more quickly. (One might even suppose that the fine distinctions of the expert may confuse the predictive coding system, by disqualifying responsive-keyword-rich documents due to technicalities.)

Is, then, the expert redundant? Certainly not! The predictive coding system only gives a rather broad-brushed prioritization to potentially responsive documents; someone still needs to go through and read these candidates to determine what is responsive or not. This is the role of the expert; and it it a role they may be better placed to play if they have been actively involved in training the predictive coder, even if their expertise is not directly required to come up with optimal training examples. It may also be, though this is speculation on my part, that a trainer who is not only a subject-matter expert, but an expert in training itself (an expert CAR driver, to adopt Ralph Losey's terminology) may be better at selecting training examples; for instance, in recognizing when a document, though responsive (or non-responsive), is not a good training example.

Leave a Reply