Random vs active selection of training examples in e-discovery

The problem with agreeing to teach is that you have less time for blogging, and the problem with a hiatus in blogging is that the topic you were in the middle of discussing gets overtaken by questions of more immediate interest. I hope to return to the question of simulating assessor error in a later post, but first I want to talk about an issue that is attracting attention at the moment: how to select documents for training a predictive coding system.

The catalyst for this current interest is "Evaluation of Machine Learning Protocols for Technology Assisted Review in Electronic Discovery", recently presented at SIGIR by Gord Cormack and Maura Grossman. In fact, there are two questions, somewhat conflated in Maura and Gord's paper. The first is, should the initial seed set of training examples be selected by random or by judgmental sampling (for instance, based on a keyword query)? The second is, having seeded the training cycle, how should subsequent training examples be selected? I'll tackle the second issue here; I hope to return to the first in a later post.

Gord and Maura describe three ways that we could select post-seed training documents. These are:

  1. Random Draw a simple random sample of the unlabelled documents in the collection.
  2. Uncertainty Select those documents whose responsiveness the predictive coding system is most uncertain about.
  3. Relevance Select the unlabelled documents the predictive coding system thinks are most likely to be responsive.

Both Uncertainty and Relevance are forms of active learning, in which the output of the predictive coding system guides the choice of training examples. In contradistinction, random selection is sometimes referred to as passive learning. There are other possible methods (for instance, we could ask the subject matter expert to search for new training examples; or one could seek to diversify the selection), and variations and mixtures of the above. But here we will look at these three only, and each in its simplest form.

All three of the above methods are used and advocated in e-discovery practice. Uncertainty is perhaps the predominant mode, but Herb Roitblat (among others) is a supporter of random or passive selection. Relevance-based selection, meanwhile, is use in the "continuous active learning" advocated in Gord and Maura's paper, and is a component of continuous ranking method deployed by Catalyst (along with a mix of complementary selection methods).

There are also two different criteria to consider in evaluating selection methods:

  1. Which method improves classifier accuracy with the least training documents?
  2. Which method achieves production recall with the minimum total coding effort (both training and review)?

In protocols where no document is produced without human review, these criteria are distinct. Here, the goal of classifier improvement is reduced review effort; but at some point, the marginal reduction in review cost may be less than the additional cost of training. Moreover, each likely-responsive training document is one fewer document to be reviewed. This is the motivation of the continuous approach: training and review are the same (continuous) process, and likewise this process "continues" until sufficient relevant documents are found. Nevertheless, in this post, I am only going to consider which selection method improves classifier accuracy most efficiently. Besides avoiding the complexity of a fuller cost model, this choice allows us to analyze classifier effectiveness in isolation. I hope to examine total production cost in a later post.

(There is a third criterion by which we could evaluate selection methods: whether they lead to a biased classifier, one that covers certain sub-topics better than others. Random selection aims to avoid this bias. Once more, I hope to look at this in a later post.)

I'm also going to throw in a fourth document selection pseudo-method:

  1. Balanced Randomly sampling an equal number of responsive and non-responsive training examples at iteration.

Of course, this method cannot be implemented in practice, since we don't know at selection time which documents are responsive and which are non-responsive (if we did, we would not need a predictive coding system!). Introducing the "Balanced" method, however, allows us to examine the effect on classifier effectiveness of the distribution of positive and negative training examples. (We can implement "Balanced" in simulation studies, because all labels are oracularly known to us, even if they are withheld from the classifier for experimental purposes.)

The experimental data I'll use for this study is the RCV1v2 dataset, a standard dataset for text classification work. RCV1v2 consists of slightly over 800,000 Reuters newswire articles, annotated with 103 hierarchical news categories. The data is, therefore, not typical of e-discovery data; but in compensation, it is relatively free of noise, provides many topics to investigate, is fully annotated, and is freely available (so please try replicating my results!).

Experimental details: I use the tokenized version of the dataset, from Online Appendix 12. The classifier is Vowpal Wabbit version 7.4, with the flags --loss_function=logistic --learning_rate=2 --passes=1 --raw_predictions. All methods, including the "balanced" method, start off with an identical random sample of 100 documents. Subsequent iterations are of 100 documents each, up to 1500 documents. The target set from which training examples are drawn is a random sample of 200,000 documents; a disjoint random sample of 20,000 documents is used as the test set. The evaluation metric is Maximum F1; that is, the highest F1 achieved across all cutoffs of the ranking on the test set.

The following figure shows the Max-F1 learning curves (solid lines) for each of our 4 selection methods across 14 representative topics. (I include only topics with richness above 1%, to match the random seed set size of 100.) The figures also show the proportion of responsive documents in the training set (dotted lines):

Learning curves for different example selection methods

Learning curves for different example selection methods

(Click here for a scalable PDF version.)

Two things stand out from these results. First, the Uncertainty method (solid green line) outperforms the Random and Relevance methods (red and black lines) in almost every case. And second, the Balanced pseudo-method (blue line) performs as well as the Uncertainty method for larger training sets, and better for smaller sets on low richness topics. In fact, the richness of training data of the Uncertainty method follows the Balanced method quite closely in most cases; and it is when Uncertainty has fewer positive examples than Balanced that the former method underperforms the latter. Since the Uncertainty method selects training documents that it predicts are 50% likely to be responsive, it is not surprising that it creates a relatively balanced training set; perhaps balance is as important as the selection of uncertain documents per se for Uncertainty's superior performance.

A closer examination of the Random and Relevance methods confirms the importance of training example balance. Random sampling produces approximately the same richness in the training set as in the collection, and does very poorly with low richness topics, though not so badly with higher richness ones. Conversely, the Relevance method does respectably for low richness topics, but poorly for high richness ones. Indeed, for the latter topics, many selection iterations produce almost solely responsive examples. Of course, this is in part what the Relevance method sets out to achieve; but the effect on classifier effectiveness is marked: for several high-richness topics, classifier effectiveness under the Relevance method actually decreases with additional training data.

The previous figure gave results for 14 topics; now let's look across all 51 topics with richness of at least 1%. I'll divide them into topics with richness of 3.4% or lower (26 topics), and those with richness above 3.4% (25 topics). The Max F1 score at each training set size for each selection method, averaged across topics, is as follows:

Mean MaxF1 across topics for different selection types

Mean MaxF1 across topics for different selection types

The trends we observed previously are confirmed: Relevance outperforms Random selection for low richness topics; Random outperforms Relevance for high richness topics; but Uncertainty outperforms both methods in both conditions. Balanced selection quickly rises to its asymptotic effectiveness, suggesting that one of the main challenges early in training is to find sufficient responsive training documents (though examining the results of the Relevance method for individual topics in the first figure alerts us of the danger of getting "stuck" in a sub-topic by trying too directly to focus on responsive training examples). With longer training runs, though, the Uncertainty method achieves a higher asymptotic effectiveness than Balanced selection---an interesting finding, that suggests Uncertainty selection does work to exclude poor training documents.

These results should be compared with those of Dave Lewis and William Gale, "A Sequential Algorithm for Training Text Classifiers" (SIGIR, 1994), and particularly of the corrigendum. Lewis and Gale likewise compare the Uncertainty and Relevance methods, although with a different classifier and a different dataset. In particular, all of the topics considered had very low richness (from 0.07% to 0.49%); Lewis and Gale get around the bootstrapping problem on such low richness topics by starting with 3 randomly selected positive examples. On these low-richness topics, the Relevance method gave similar performance to Uncertainty selection, though with smaller training set sizes, Uncertainty was generally superior. The average performance of the different methods on RCV1v2 topics with richness below 1% (but starting with a random sample of 100 documents) is shown in the following figure:

Mean MaxF1 across topics with richness below 1%

Mean MaxF1 across topics with richness below 1%

These results are similar to Lewis and Gale: on such low richness topics, Uncertainty is only marginally better (on average) than Relevance selection. For such low richness topics, though, it is unlikely that the seed set would be created by the tiresome process of randomly sampling till enough positive examples occur. I hope to return to examine these low-prevalence cases with an alternative seed set selection method (such as querying).

What conclusions can we draw from this study? First, viewed solely from the proportion of relevant documents found (and ignoring questions of possible bias), the Random selection of training documents is a poor choice, unless richness is quite high (say, at least over 5%, preferably over 15%). In terms of achieving classifier performance, the Relevance selection method is also a poor choice, except for very low richness topics (below 1%). For classifier performance itself, the Uncertainty method is the clear winner amongst the methods considered, outperforming even the unimplementable Balanced method for larger training set sizes.

As mentioned above, though, what we generally care about in e-discovery is not classifier quality itself, but the cost to achieve a certain recall (that is, find a certain finite number of relevant documents). In protocols where every produced document must be reviewed, selection methods that pick more responsive documents for training are preferable (other things being equal), because they leave fewer documents for review. I hope to examine the efficiency of the different models under this cost model in a later post; for now, we can note that under such a model, the Random method will look even worse, the Relevance method somewhat better, since they produce low and high (respectively) richness in their training sets.

Finally, note that the Random, Uncertainty, and Relevance methods are implemented here in their simplest and "purest" versions. For the Uncertainty method, for instance, it would be possible (and perhaps preferable) to draw a sample of documents from around the point of maximum predictive uncertainty, rather than exactly the 100 most uncertain documents. Or one could simply combine all three methods, drawing a certain number of training documents by each method, and perhaps adjusting the mix as the training went on. Each method might then supply some of the wants of the other. For now, though, it appears that (at least for classifier quality), Uncertainty selection is the method to beat.

Thanks to Gord Cormack, Maura Grossman, Dave Lewis, Jeremy Pickens, and Herb Roitblat for comments which helped improve this post. The opinions expressed here, of course, are my own, and not theirs.

5 Responses to “Random vs active selection of training examples in e-discovery”

  1. Ethan A says:

    William, thanks much for the timely thoughts and work on classifier training input choice - itself more light on Lewis & Gale's work. (everything old is new again...)

    One criticism of the Cormack/Grossman work on this topic is that the datasets' annotations were in many cases overwhelmingly derived by prior classifier operation. See Herb Roitblat's cogent concerns under 'Internal Invalidity' at http://orcatec.com/2014/05/27/the-science-of-comparing-learning-protocols/ . In particular, all unannotated records were assumed non-responsive. (Also to reiterate Herb's point, using this 'tainted' data doesn't disprove the conclusions, it may well still be that 'un-tainted' data would also support them.)

    Your choice of RCV1v2, a fully (independently) annotated dataset, would seem to moot this criticism.

    -Was this also a reason for chosing the dataset? If not, doesn't it serve this purpose anyway?

  2. william says:


    Yes, it is unclear from Gord and Maura's paper how the gold-standard data was derived, at least for the legal cases. For the RCV1v2 collection, all topic labels were manually applied, and all documents were considered for labelling. This is one of the advantages of this data set, and a reason why it is widely used for this sort of text analytics work.


  3. gvc says:


    Herb's criticism is incorrect. Please read the paper to see that the passage he quotes pertains to how the system is trained, not to how it is evaluated. The gold standard used to evaluate the results is a statistical sample of the entire dataset: For the four TREC topics it was created by TREC, for the four legal matters it was created by the legal team.


  4. Ethan A says:

    Gordon, that is indeed the case - thanks for emphasizing the distinction for anyone reading the comments. (
    "In contrast to the training standard, the gold standard represents..." does indeed make that rather clear.)

    Any comment on the deviations between William's measured performance with (a Vowpal implementation of) uncertainty scores and your reported performance with (Sofia-ML-implemented) uncertainty scores is also most welcome?. Understood that William's emphasis on classifier effectiveness means he's measuring to max F1 and your paper is focusing on reported Recall (and Recall@.75)

  5. gvc says:


    William studied a different problem, in which the maximum number of training documents was limited to 1,500. Since the number of relevant documents far exceeds this number (1% of the collection is 8,000 documents, and William excluded all topics with less than 1% prevalence) the relevance-sampled classifier is not employed in anything like the CAL scenario we studied, where review and training continue in lock-step until substantially all relevant documents have been identified -- and used to train the classifier.

    William really compared two implementations of SAL, one using uncertainty sampling and one using relevance sampling. For SAL, uncertainty sampling is generally better.

    Although I have not done a side-by-side comparison, I would expect Vowpal Wabbit and Sofia-ML to give comparable results. The underlying methods are similar.


Leave a Reply