Total review cost of training selection methods

My previous post described in some detail the conditions of finite population annotation that apply to e-discovery. To summarize, what we care about (or at least should care about) is not maximizing classifier accuracy in itself, but minimizing the total cost of achieving a target level of recall. The predominant cost in the review stage is that of having human experts train the classifier, and of having human reviewers review the documents that the classifier predicts as responsive. Each relevant document found in training is one fewer that must be looked at in review. Therefore, training example selection methods such as relevance selection that prioritize relevant documents are likely to have a lower total cost than the abstract measure of classifier effectiveness might suggest.

In this post, I am going to evaluate three selection methods—random, uncertainty, and relevance—by the total review cost required to achieve a target recall. For simplicity, I'll assume that every document costs the same amount to be labelled, whether in training or review. I'll set the target recall level at 80%, and refer to the cost metric as cost for recall 80 (cfr80). (Cormack and Grossman refer to the same measure as "80% recall effort".) A run consists of a certain amount of training, followed by review to the depth that achieves 80% recall. The total cost of a run is the training set size plus the review depth. A sequence of (potential) training iterations defines a set of runs, each of which switches from training to review at a given iteration, and each of which has its own cost. The run with the lowest cost is the one with the optimal point of switching from training to review (that is, the optimal training set size). The cost of the set of runs can be represented as a curve; the minimum cost is then the lowest point in this curve. One could also continue training until target recall was achieved in the training set alone, as in continuous training; this strategy is particularly attractive for the "relevance" selection method. An interesting question, then, is if the cfr80 achieved by relevance selection under continuous training is the minimum cfr80 for its run set.

Each training sequence begins with a randomly-sampled seed set. For the uncertainty and relevance selection methods, random sampling continues until at least one relevant document enters the training set, at which point we switch to the appropriate selection method (for random selection, of course, random sampling continues throughout). As with previous experiments, the RCV1v2 dataset with its 103 topics is used again. As implied by the finite population scenario, training and target sets are not held distinct; instead, I take 700,000 of the 804,414 RCV1v2 documents as the single, undifferentiated collection. (I'm holding the remaining 104,414 documents in reserve for experiments on sampling and estimation.) The classifier is Vowpal Wabbit 7.4, with logistic loss, a learning rate of 2.0, and a single pass. Training iterations are by steps of 100 to 2000; then of 200 to 5,000; then of 500 to 20,000; and finally of 2,000 to the end.

Let's begin with the cost curves for a particular topic, Topic E41.

Relative cost for 80% recall curve for Topic E41

Relative cost for 80% recall curve for Topic E41

There's a fair bit of information in this figure, so we'll take things bit by bit. The title provides the richness of the topic, along with the number of relevant documents that need to be located to achieve our target of 80% recall. (This target number is richness * 700,000 * 0.8, with 700,000 being the collection size.) The x axis states the training set size as a multiple of the target size, to facilitate comparison between topics of different richnesses. The y axis gives the cost to achieve 80% recall, again as a multiple of target size. (Note the log scale on the y axis, giving greater clarity to the curve minima.) The absolute minimum of relative cost for recall is 1.0, which would occur if all the documents seen, both in training and review, were relevant.

The green, blue, and red curves provide the relative cost for recall of the uncertainty, random, and relevance selection methods, given the training set size in the x axis. Lower values on this curve are better (indicating less total review cost). Observe that the (green) curve for the uncertainty method falls sharply with the initial training runs, indicating the quality of the predictive ranking rapidly improves, and so review depth quickly decreases. The improvement slows over time, however, and after a relative training set size around 0.5 (that is, a training set around 11,863 * 0.5 ≈ 6,000), the total cost curve begins to rise. This does not (necessarily) mean that the quality of the predictive ranking is actually getting worse; rather, it indicates that the cost in training exceeds the saving in review. In contrast, the cost curve for the relevance selection method decreases much more slowly; until relative training set size is almost 1.0 (≈ 11,000), relevance selection does not consistently beat random selection. After that, though, it drops quickly, and goes below the uncertainty sampling curve.

The trajectories of the cost curves for the different methods illustrate how the methods behave; what we really care about, however, is the minimum point in each curve, as this gives the actual total review cost that would be achieved (assuming, as we do throughout this post, perfect knowledge). These minimum costs are marked as notations along the y axis. Here, we can see that the minimum cfr80 values of the uncertainty and relevance methods are quite similar (the actual minimum relative cfr80 values are 1.36 and 1.41 respectively), and certainly both are much lower than the minimum for random selection (4.08). The protocol that achieves this minimum for uncertainty selection is to spend around 40% of the effort in training, and 60% in review; for relevance selection, to spend most of or all the effort in training; but total cost is much the same either way.

A question posed in my previous post was whether the relevance selection method did indeed achieve its optimal cost by following a continuous training protocol (that is, continuing to train until target recall is achieved in the training set, without a separate review step). This question can be answered (somewhat indirectly) by considering the line at which training set size and cfr80 are the same (shown as the dotted grey curve in the figure---curved because of the log y axis). When the cost curve of a method touches or cross this line, the training set by itself achieves target recall. Here, the relevance cost curve touches the train==cost curve at or near the cost curve's minimum, indicating that continuous training achieves
(near-)minimum cost on this set of runs.

Having understood the figure format for an example topic, let's look at a larger set of topics.

Relative cost for 80% recall curves for select topics.

Relative cost for 80% recall curves for select topics.

The above figure shows the cost curves for 14 randomly selected topics. (Selection was as follows: 3 randomly selected from topics with richness < 0.5% [top row]; 3 from topics with richness between 0.5% and 2% [second row]; 3 between 2% and 5% [third row]; 3 between 5% and 15% [fourth row]; and two above 15% richness [bottom row].) Cost curves for all 103 topics can be found here.

Considering the cost curves themselves, the general pattern is for uncertainty and relevance selection to track each other very closely for topics with richness below 1%. For richness above 1%, the uncertainty cost curve tends to fall faster and earlier than the relevance curve, with this tendency becoming more pronounced as richness increases; indeed, for topics with richness above 5%, relevance selection frequently leads to initially rising cost curves.

Considering minimum review cost, however, the difference in effort between relevance and uncertainty selection is only slight; only 3 topics show a difference of more than 13% one way or the other. That is, relevance and uncertainty selection have different trajectories, increasingly so for higher richness, calling for different protocols (for uncertainty, limited training followed by review; for relevance, continuous training throughout), but end up with very similar total costs.

Additionally, there are only three topics, all low richness ones (G159 [0.00% rich], E61 [0.05% rich], and C16 [0.24% rich]), in which the optimal stopping point for relevance selection is well before (> 10%) the stopping point for continuous training (that is, when recall is achieved by the training set alone). That is, continuous training is indeed the correct protocol to use for relevance selection. (Moreover, particularly for higher prevalence topics, using relevance selection without following continuous ranking through to the end can lead to very poor results.) For around a third of the topics, continuous training also gives total cost within 10% of optimal for uncertainty selection, too, but these are almost exclusively low richness topics (richness < 1%), where, as previously observed, the uncertainty and relevance cost curves are very similar. (It may well be that the two methods in fact uncover very much the same relevant documents during training, though I have not yet verified this.)

Meanwhile, it is clear from these graphs that random selection is a very expensive method for topics with low richness. It is only for topics with richness above 2.5% that one can be reasonably confident that random selection will not lead to more than twice the total review cost of uncertainty selection, and only for topics with richness above 10% that the additional cost is consistently less than 20%. There are benefits in random selection beyond total cost for binary recall—its simplicity, its lack of bias, the ability (with care) to use training data for estimation and testing purposes—but I would suggest that once richness goes substantially below 5%, these benefits are outweighed by the additional cost.

Finally, let's summarize these minimum total cost results, grouped by topic richness. Again, total cost is expressed as a multiple of the number of relevant documents needed to achieve 80% recall:

Richness Random Uncertainty Relevance
< 0.5% 2820.27 3003.53 3006.39
0.5% -- 2% 10.26 2.24 2.26
2% -- 5% 3.14 1.64 1.66
5% -- 15% 1.71 1.26 1.28
> 15% 1.06 1.04 1.03

Clearly, all methods struggle with extremely low richness topics. (For such topics, we really require a better seed set selection method than random sampling, a question that I hope to return to in a later post). There is no discernible difference in total cost between relevance and uncertainty selection. And encouragingly, the total review cost for these methods is only a little above 2 times the number of relevant documents for low richness topics, and less for higher ones (albeit on what is, in e-discovery terms, a very clean collection with relatively straightforward topics). Random selection does almost as well as the active methods for high richness topics, but becomes increasingly inefficient as topic richness falls.

Although there is little difference in minimum total cost between uncertainty and relevance selection, the trajectories by which the two methods achieve these similar total costs are quite different. As mentioned in my previous post, the continuous training method that relevance selection supports has a simpler stopping decision than uncertainty selection (one-, rather than two-dimensional). On the other hand, the generally flat minimum of uncertainty selection, combined with the gentle rise in total cost once this minimum is passed, suggests that the first stopping decision in uncertainty selection (that is, when to stop training) may not be too sensitive, while uncertainty selection may be more flexible in allowing the separation of training and review. The cost curves also suggest the attractiveness of a composite strategy, for instance uncertainty selection at the start, moving to relevance selection later on (perhaps joined with some random selection for high-richness topics).

It is important to note, however, that the above discussion paints an overly clean picture, because of the assumption of perfect knowledge of cost for recall at any point, and of the trajectory of that cost with further training. In reality, we lack such perfect knowledge, and have to rely on estimation methods, rules of thumb, and experience. Actual decisions about strategy and stopping points are made in a fog of uncertainty; and this fog becomes thicker as richness drops and sampling methods lose traction. This, too, will be a subject of future posts.


Postscript

In the above table showing average costs by different selection methods for topics of different richness bands, the random method is shown as cheaper on average than the other two methods for topics with richness below 0.5%. In fact, random is considerably more expensive on almost all such topics. However, on the two topics with the lowest richness, random is slightly cheaper; and the costs of these two topics are so high, that these two topics outweigh all the others in the average. Now, these two topics have 5 and 31 relevant documents respectively out of the population of 700,000; that is, they have richness below 0.01%. Why the random method does better on these two topics is unclear; it may simply be due to chance. In any case, such extremely low richness topics are doubtful choices for predictive coding altogether, and certainly very poor choices for random seed sets. If place topics with a richness lower than 0.05% (1 document in 2,000) in a separate category, we get the following table:

Richness Random Uncertainty Relevance
< 0.05% 11349.83 12324.16 12312.65
0.05% -- 0.5% 78.62 11.41 11.32
0.5% -- 2% 10.26 2.24 2.26
2% -- 5% 3.14 1.64 1.66
5% -- 15% 1.71 1.26 1.28
> 15% 1.06 1.04 1.03

Thanks to Ralph Losey for pointing out this anomaly (and urging me to do something about it!).

9 Responses to “Total review cost of training selection methods”

  1. gvc says:

    Thanks, William, for confirming the results of our SIGIR paper (http://dl.acm.org/citation.cfm?doid=2600428.2609601) : there is little to recommend random training; and there is little to choose (in terms of hitting a specific target recall, given clairvoyance to pick the best training-set size) between continuous active learning with relevance selection, and simple active learning with uncertainty selection.

    We have also confirmed our results on the RCV1-v2 collection, using a single synthetic seed document (derived from the topic description supplied with the collection) and a different learning algorithm (http://svmlight.joachims.org).

    We prefer to represent the results as gain curves, which you can find at http://cormack.uwaterloo.ca/cormack/rcv1results/. They tell the same story as your curves, but track recall as a function of total effort, instead of cfr80 (80% recall effort) as a function of training-set size. The main curves on our site are chosen (for comparison to our SIGIR paper) to minimize cfr75, but for the purpose of comparison to your results, we also plotted curves chosen to minimize cfr80: http://cormack.uwaterloo.ca/cormack/rcv1results/recall80effort.pdf

    We note that our use of a synthetic seed document appears to overcome the problem you noted with ultra-low prevalence topics, which appears to be an artifact of your use of random seeding, even for the active learning methods. The same problem was overcome in our SIGIR paper by the use of a simplistic "seed query."

  2. […] Read the complete article at: Total review cost of training selection methods […]

  3. I am trying hard not to be pulled into this debate and remain neutral on different workflows. I applaud your effort to look at collections of varying richness which mimics the real world. Some users of Predictive Coding encounter collections which are greater in richness than .3% to 1.5% which are the predominant richness levels in the SIGIR paper. But speaking as an attorney, we have an obligation to do "discovery". By aiming which is what seeding does, don't we first lose visibility into the richness of the collection to begin with which may cause a user to lose sight that random could be effective on that collection. But secondly, don't we make it much harder to find anything else if we do limited random training AFTER we have pulled the cream, or most of what we know, out of the collection? The problem as I see it is we are only looking for what we know when we seed but we don't know what we don't know which is a lawyer's job to do investigation or a "reasonable inquiry". Secondly, I know of no studies which measure the differences of types of documents found. Everyone is looking a hit counts to see how well we did in Recall when compared to the baseline in the studies. If we find lots and lots of similar documents, we do well in Recall. But as a lay person, it strikes me as intuitive that finding different types of information is a key part of discovery. That approach might give both sides of the litigation comfort that a "honest" and effective search is being done.

    I know this is a difficult problem to solve. Low richness may be endemic in some cases today. But I personally have hope that we will figure that out with different approaches such as rolling productions or Early Case Assessment training runs where richness might be higher, different culling approaches such as visual clustering to increase deduplication and eliminate much higher percentages of non-responsive documents, or have a high richness case like we did in Global Aerospace where random training was highly effective and required only a 5,000 document training set to reveal 130,000 documents out of a collection of 2 million records. That was a highly efficient result and it happened in a real case as opposed to a laboratory. So while not a study, as part of the fabric of common law which is where most of the experimenting happens for lawyers, it shows this approach can be a very effective, very efficient, and transparent solution to searching for responsive documents.

    I feel like I am writing a blog post here but after returning from Greenwich England from a trip last month, I am struck by how similar this debate is to the search for Longitude between astronomers and clock makers to steer ships across the oceans of the world. It took 50 years to figure out. With only a few participants such as yourself actively studying and publishing in this space, I am fearful that lawyers will blindly follow the first study they read and stop thinking how it might make more sense to use all the tools at their disposal to experiment and improve their search. Keep up the good work. Personally I am trying to keep an open mind on this topic instead of saying we have the solution right now.

  4. william says:

    Karl,

    Hi! I agree that it is desirable that a predictive coding process is able to demonstrate that it has "good coverage of the information space". Think of the information space as a map, over which documents are distributed as points. (This is a reasonable model of the internal representation of many predictive coding systems, though the "map" is in thousands of dimensions.) Then you want to have training examples that provide good coverage of this map, to show that you have considered as much of the information space as possible. Simple random sampling gives you something of this effect, but not as well as you might think: clusters of similar documents will tend to be over-sampled, and even if the documents were evenly distributed through the information space, the random sample would tend to randomly bunch in places. Some forms of active learning are actually fairly good, albeit in an indirect way, at driving example selection away from parts of the space you "know about", and towards part you "don't know about". Now, intuitively, I agree that relevance selection (or continuous active learning) may tend not to give you good coverage of the information space, though as you say there have not been empirical studies that demonstrate this one way or another (at least that I have seen).

    Vendors are becoming aware of the importance of diversification of training examples, and started to introduce this feature to their products. Two that spring to mind are Relativity, with their "stratified sampling" technique (http://kcura.com/relativity/advice-blog/articletype/articleview/articleid/955/an-introduction-to-stratified-sampling-in-relativity-assisted-review) (I think "cluster sampling" would be a better term, though); and Catalyst, with their contextual diversity technology (http://www.catalystsecure.com/blog/2014/07/comparing-active-learning-to-random-sampling-using-zipfs-law-to-evaluate-which-is-more-effective-for-tar/). Note that such methods can also involved random sampling, just not simple random sampling; that is, we focus sampling effort on what we think will be "high value" parts of the collection, without adopting a fully deterministic selection method. In fact, this is akin to your own suggestions about using more intelligent clustering tools: one might cluster the collection, and then sample the clusters, rather than the raw documents.

    What I think would be very regrettable, however, is if practitioners became stuck on the idea of simple random sampling as the only defensible example selection method, and then concluded that collection richness must be kept high enough for random selection to be practicable. That would lead, I fear, to aggressive pre-review culling, performed without adequate statistical or other validation, and leading to a bias of "aiming towards what the lawyers already know about" much more serious than might be incurred by non-random selection methods within a predictive coding process run over an unculled (or at least more lightly culled) collection. (Indeed, I wonder whether the falling richness rates we seem to be encountering is due to an increasing, and welcome, reluctance amongst practitioners to perform this sort of arbitrary culling.)

    Finally, I whole-heartedly agree that predictive coding technology and process is still far from mature, and that it would unwise for the community to think otherwise and simply "pick what seems best at the moment". The remedy against this, to my mind, is maintaining objective and rigorous validation practice (most particularly, through a certification sample of production and discard set).

  5. This is another fascinating read for those of us interested in this important subject. These studies are important and it is critical that we keep the variables to a minimum in order to draw reasonable conclusions about the data.

    In the real world, we use use a mix of techniques to address the issues discussed above. Time will tell who has the right mix but here is how we approach things (in short form without writing another blog post).

    1. I won't take a position on whether you cull in advance. What I can say is that when you receive a production, you get what you get. But anything else is for a different posting.

    2. We believe that you start by finding as many relevant documents as possible using any means at your disposal--from keyword search to witness interviews. Use all of these as initial training seeds.

    3. We then believe you let the review teams loose on the documents. There is no need to have a senior lawyer do the training. It is expensive, delays the process and gains little.

    4. We were CAL before CAL was cool (just riffing off an old country song). What I mean is that we feed reviewer judgments back into the system so it can keep learning. Training is review, review is training, as Dr. J would say.

    5. However we also worry about what you don't know. So, in every CAL batch we present a numbers of documents selected through an "uncertainty" process. The specific number in our mix is a trade secret.

    6. Rather than selecting those uncertainty documents randomly, we wrote a contextual diversity algorithm which analyzes all of the documents, identifies those which we know the least about (as in the reviewers have not yet tagged similar ones) and then choose a representative sample from each group.

    7. By presenting the right mix of contextually diverse documents to the reviewers, we mix relevance feedback and uncertainty sampling as a key part of our process.

    8. Doing this eliminates or minimizes the "you don't know what you don't know" syndrome. We are actively sampling around uncertainty.

    9. Ultimately our goal dovetails with William's starting goal. We are trying to minimize review costs by making the review as efficient as possible. The mix of using judgment seeds rather than random to start, by using a good mix of highly relevant docs for the review (since they need to look at them anyway) and supplementing that mix with uncertainty sampling seemed to us to be the best way to construct our system. The research seems to bear this out.

    10. Lastly, we measure results with a systematic sample, which means we sample the ranking from top to bottom. We felt this was at least a bit better than a final random sample. And every little bit helps reduce review costs.

    That's our thinking at the least. Thanks again.

  6. william says:

    John,

    Hi! A great summary of a great process. Of course what I'm describing in these experiments are the purest and simplest implementations of the distinct strategies; a mix of strategies and of human knowledge is likely to get you better results than any single strategy used in isolation. And I entirely agree that diversity is an an important issue, that my experiments do not directly address.

    William

  7. Joe Howie says:

    All discussions about text analysis ought to be tempered with the knowledge that in some collections significant percentages of documents do not have any related text or have only poor-quality text, e.g., documents saved as PDF from native applications, or scanned documents – sometimes well above 25-30%. Parties who base initial selections on key word searches will miss collecting those documents. Recall as measured against what was handed off to the text analysis system may vary considerably from what true recall would have been had the non-text or poor text documents been included and accurately evaluated.

    Even if no text/poor text documents are ingested into the review process, there’s nothing there for text analysis to analyze. Having some metadata associated with the non-text/poor text documents is far short of having text from the face of the documents.

    When visual classification is used during litigation, the review team can examine a few documents per cluster to determine the potential relevance of all the documents in a cluster. This is similar in some ways to the descriptions I’ve read of Catalyst’s cultural diversity algorithm and Relativity’s stratified sampling technique except it is based on visual classifications of all documents not on groupings from the subset of text-bearing documents.

  8. gvc says:

    Joe,

    You and John have commented in practically every edisco blog that your "visual clustering" tool will effectively find relevant in textual and non-textual documents alike.

    It is not clear to me from reading your marketing material what your process is or why it would work. I understand that you identify "glyphs" and do clustering, but that's not enough to code my documents.

    In contrast, I believe I understand the role of Catalyst's diversity sampling and Relativity's "stratified" sampling -- both are methods of identifying training documents for a learning algorithm. Perhaps yours is more similar to Relativity's, in which you select members of each cluster for training? Or do you manually review each cluster?

    Can you point me to a study that elaborates your process and validates your effectiveness claim?

    Gordon

  9. Joe Howie says:

    Here's a high-level look at the process: Documents are clustered based on visual similarity, not on an analysis of associated text. For example, if a Word document is later saved directly to PDF, and a copy is printed, scanned, and saved as TIF, visual classification will place all three copies in the same cluster whether or not the TIF and PDF copies have any text.
    Clustering can be scaled with hardware to process enterprise-size collections.

    These clusters are reviewed by the review team with the largest clusters being reviewed first. Clusters that are clearly irrelevant can be discarded based on a knowledge of what is in the cluster, and documents in clusters that are clearly relevant can all be deemed relevant. Clusters that may contain some but not all relevant documents can be passed on to the review platform where like documents can be batched to the same reviewers speeding review and helping ensure consistency

    The review process can begin as soon as documents have been collected and the clustering has begun. Relevance decisions are persistent and once made are applied to documents that are later added to the cluster.

    BeyondRecognition also has "Find" technology that provides search capability with operators based on absolute page coordinates (i.e., within certain range of coordinates on a page) or on relative page coordinates (i.e., one range of coordinates with a specified position relative to another) as a way to further refine decision-making. Visual classification technology also has advantages beyond culling, e.g., redactions can be made based on the page locations of text values, or can be done zonally within clusters to redact even handwriting.

    If visual classification has been implemented as a core information governance tool, the organization may also have applied document-type labels to the clusters as a way to assist in applying retention schedules, setting user access permissions, detecting PII, and determining where the content should be stored, and those labels can be used when first gathering or collecting documents. For example, the various clusters that have been labelled as "invoices" will often not be remotely relevant to most litigation and need not be collected.

    Tying selection to ongoing information governance minimizes the ongoing e-discovery problem of continually reinventing the wheel. For example, if the files or documents of someone who is routinely provided copies of invoices are included in multiple litigation discovery collections, how many times does the company have to pay to have all those invoices processed and evaluated to determine that they are not relevant?

Leave a Reply