Does automatic text classification work for low-prevalence topics?

Readers of Ralph Losey's blog will know that he is an advocate of what he calls "multimodal" search in e-discovery; that is, a search methodology that involves a mix of human-directed keyword search, human-machine blended concept search, and machine-directed text classification (or predictive coding, in the e-discovery jargon). Meanwhile, he deprecates the alternative model of machine-driven review, in which only text classification technology is employed, and the human's sole function is to code machine-selected documents as responsive or non-responsive. The difference between the two modes can be seen most clearly in the creation of the seed set (that is, the initial training set created to bootstrap the text classification process). In multimodal review, the seed set might be taken from (a sample of) the results of active search by the human reviewer; in machine-driven review, the seed set is formed by a random sample from the collection.

Late last year, Ralph journalized an experiment in multimodal review on the Enron collection. Currently, he is journalizing a more fictionalized (but still empirically-based) account of machine-driven review (colourfully titled "Journey into the Borg Hive"). In a recent post in the latter series, he raises what is perhaps the most important point in the dispute between the multimodal and machine-driven approaches: what happens when your collection has a low prevalence of responsive documents---when only half a percent of your documents are responsive, rather than ten percent? Prevalence is partly determined by the topic, but (as Ralph notes) is even more strongly driven by collection and culling decisions. The lighter the culling of the collection (by keyword filters, for instance), the lower the prevalence. But crude culling (and keyword culling counts as crude) risks excluding relevant material, and so should be avoided, especially now that automated text analysis technologies allow (in principle) arbitrarily large collections to be efficiently searched. Therefore, low prevalence collections will become increasingly frequent in practice.

What difference does collection prevalence make to the choice between multimodal and computer-driven review? Well, text classification relies on having both positive and negative examples. The less balanced the example set is, the more examples you require to achieve a given level of effectiveness in the classifier. Indeed, a rough rule of thumb is that cost of training scales more with the number of examples in total, at least where examples are selected by pure random sampling.

The need for positive examples creates a boot-strapping problem for a machine-driven classification approach when collection prevalence is low. If responsive documents are rare, then you'll need a large initial random sample (and therefore extensive review time) to locate them. It would likely be much faster to find an adequate number of relevant documents by a human-directed (keyword or concept) search. Moreover, learning is likely to be slower the fewer relevant documents there are in the initial sample set, even with active learning; again, it might be more efficient to have the human actively help locate relevant documents in the early stages of the search, and delay moving into computer-driven mode until the review process is well under-way.

How serious is the effect of positive-example seed-set deprivation on the efficiency of text classification? The impact on seed-set creation is easy to calculate; it's a simple question of sampling. Let's say that a minimum of ten responsive documents are required to get the text classifier going; if one in a thousand documents in the collection is responsive, then on average a sample of ten-thousand documents is required to make an adequate seed set (more if you want a reasonable degree of confidence that at least ten positive examples will be found). With a low enough prevalence, a randomly-sampled seed set may simply be a non-starter.

Say, though, we have a more intermediate case, where prevalence, while low, is still sufficient to give a handful of responsive documents in a moderately-sized sample; Ralph's own example produces 10 relevant documents in a sample of 2,401, for a sample prevalence of just under half a percent. The effect of such a low-prevalence seed set on speed of learning is complex, particularly when active learning is being used, as is the case with most classification systems these days. (In active learning, the computer selects for coding at each training iteration those documents that it is least certain about. This is generally much more efficient than selecting documents at random, but still requires an initial seed set to get started.) Active learning should in principle compensate for low prevalence, by exploring ambiguous regions of the document space. How strong this effect will be in practice is an empirical question.

What, then, is the argument in favour of solely computer-driven review, even in the face of low-prevalence tasks (apart from the fact that it is much simpler to build a user interface for batch annotation than for active search)? The main argument is the assertion that relying on human judgment to select seed documents will bias the results towards the human's initial conception of relevance; that responsive documents similar to those found in the initial keyword search are more likely to appear in the final production than responsive documents different from the seed set. Certainly, human choice in creating the seed set will have some impact on final classifier output. On the other hand, in the subsequent training iterations, active learning will select documents that are unlike the initial seed set, and so over time should drive the classifier away from dependence on the human's initial judgment. Again, the strength of this effect is an empirical question.

So, we have two empirical questions about the comparative impact of random and human-created seed sets for active learning on low-prevalence topics. Is active learning on a seed set with only a handful of responsive documents able to close the effectiveness gap with a higher-prevalence seed set populated by human search, and if so, how quickly? And how strongly does a human-selected seed set bias final classification results towards the sub-topic represented in that seed set?

2 Responses to “Does automatic text classification work for low-prevalence topics?”

  1. Ralph Losey says:

    In my experience the commercial software I have been using, Inview, works quite well to overcome the difficulties you describe, including initial bias (a term I dont like because it is contra the whole idea of SME), and the low number of positive examples as compared to high number of irrelevant documents. It takes several rounds of training. Sometimes a round can be quite fast and only involve a few corrections. Other times you need to feed in may corrections.

    To be successful at this you have to develop a knack for identifying the types of documents that may impede or further the active learning process. That is often case specific, but can also be general, at least for the software you are using. For instance, for the software I use I have learned to remove very large text files form the initial training. I have found they are too confusing to the classifier. I use other methods to search these files and or add them in at the end of a training process.

    There are other types of files that should be removed as well, plus still others, usually case specific, that are especially helpful to train on. The whole process becomes more art than science, and has to do with learning the peculiarities of the software in general, and on particular projects.

    I guess what I am saying is that effective active machine learning is a two-way process requiring a searcher to pay close attention to the impact of their actions on the automated classifications.

    I would be interested to hear your reaction to this, and other readers of your blog? Are others having the same or similar experiences? Has their been any research on this? Literature on the subject? I am a lawyer, not a scientist, so please do not assume I know the basic writings available in the field. Thanks.

  2. william says:


    Hi! Yes, I think that the sort of human-directed search facilities that Inview offers are an important part of the solution to the low-prevalence case. The comparative experiment you ran on the EDRM dataset was an interesting exploration of that. What would have happened in this situation if you were stuck in the pure-Borg approach -- even the seed set had to be a simple random sample? Prevalence there was, what, one in a thousand? With some Borg approaches you'd need to sample 60,000-odd documents just to get your control set!

    I'm interested too in your comment on the importance of the trainer knowing what documents do and do not make good training examples. In principle, the classifier should be able to manage this automatically -- but apparently not! For instance, if long documents _always_ cause problems, then the system should be pre-configured not to train on them. Or cross-validation experiments should be able to determine that certain classes of training documents are not helpful -- indeed, in principle it seems to me that the classifier should be able to note this directly, at least if given the correct features. Anyway, as always, interesting material for future research!

Leave a Reply