Readers of Ralph Losey's blog will know that he is an advocate of what he calls "multimodal" search in e-discovery; that is, a search methodology that involves a mix of human-directed keyword search, human-machine blended concept search, and machine-directed text classification (or predictive coding, in the e-discovery jargon). Meanwhile, he deprecates the alternative model of machine-driven review, in which only text classification technology is employed, and the human's sole function is to code machine-selected documents as responsive or non-responsive. The difference between the two modes can be seen most clearly in the creation of the seed set (that is, the initial training set created to bootstrap the text classification process). In multimodal review, the seed set might be taken from (a sample of) the results of active search by the human reviewer; in machine-driven review, the seed set is formed by a random sample from the collection.
Late last year, Ralph journalized an experiment in multimodal review on the Enron collection. Currently, he is journalizing a more fictionalized (but still empirically-based) account of machine-driven review (colourfully titled "Journey into the Borg Hive"). In a recent post in the latter series, he raises what is perhaps the most important point in the dispute between the multimodal and machine-driven approaches: what happens when your collection has a low prevalence of responsive documents---when only half a percent of your documents are responsive, rather than ten percent? Prevalence is partly determined by the topic, but (as Ralph notes) is even more strongly driven by collection and culling decisions. The lighter the culling of the collection (by keyword filters, for instance), the lower the prevalence. But crude culling (and keyword culling counts as crude) risks excluding relevant material, and so should be avoided, especially now that automated text analysis technologies allow (in principle) arbitrarily large collections to be efficiently searched. Therefore, low prevalence collections will become increasingly frequent in practice.
What difference does collection prevalence make to the choice between multimodal and computer-driven review? Well, text classification relies on having both positive and negative examples. The less balanced the example set is, the more examples you require to achieve a given level of effectiveness in the classifier. Indeed, a rough rule of thumb is that cost of training scales more with the number of examples in total, at least where examples are selected by pure random sampling.
The need for positive examples creates a boot-strapping problem for a machine-driven classification approach when collection prevalence is low. If responsive documents are rare, then you'll need a large initial random sample (and therefore extensive review time) to locate them. It would likely be much faster to find an adequate number of relevant documents by a human-directed (keyword or concept) search. Moreover, learning is likely to be slower the fewer relevant documents there are in the initial sample set, even with active learning; again, it might be more efficient to have the human actively help locate relevant documents in the early stages of the search, and delay moving into computer-driven mode until the review process is well under-way.
How serious is the effect of positive-example seed-set deprivation on the efficiency of text classification? The impact on seed-set creation is easy to calculate; it's a simple question of sampling. Let's say that a minimum of ten responsive documents are required to get the text classifier going; if one in a thousand documents in the collection is responsive, then on average a sample of ten-thousand documents is required to make an adequate seed set (more if you want a reasonable degree of confidence that at least ten positive examples will be found). With a low enough prevalence, a randomly-sampled seed set may simply be a non-starter.
Say, though, we have a more intermediate case, where prevalence, while low, is still sufficient to give a handful of responsive documents in a moderately-sized sample; Ralph's own example produces 10 relevant documents in a sample of 2,401, for a sample prevalence of just under half a percent. The effect of such a low-prevalence seed set on speed of learning is complex, particularly when active learning is being used, as is the case with most classification systems these days. (In active learning, the computer selects for coding at each training iteration those documents that it is least certain about. This is generally much more efficient than selecting documents at random, but still requires an initial seed set to get started.) Active learning should in principle compensate for low prevalence, by exploring ambiguous regions of the document space. How strong this effect will be in practice is an empirical question.
What, then, is the argument in favour of solely computer-driven review, even in the face of low-prevalence tasks (apart from the fact that it is much simpler to build a user interface for batch annotation than for active search)? The main argument is the assertion that relying on human judgment to select seed documents will bias the results towards the human's initial conception of relevance; that responsive documents similar to those found in the initial keyword search are more likely to appear in the final production than responsive documents different from the seed set. Certainly, human choice in creating the seed set will have some impact on final classifier output. On the other hand, in the subsequent training iterations, active learning will select documents that are unlike the initial seed set, and so over time should drive the classifier away from dependence on the human's initial judgment. Again, the strength of this effect is an empirical question.
So, we have two empirical questions about the comparative impact of random and human-created seed sets for active learning on low-prevalence topics. Is active learning on a seed set with only a handful of responsive documents able to close the effectiveness gap with a higher-prevalence seed set populated by human search, and if so, how quickly? And how strongly does a human-selected seed set bias final classification results towards the sub-topic represented in that seed set?