Evaluating E-Discovery http://blog.codalism.com William Webber's E-Discovery Consulting Blog Wed, 14 Mar 2018 23:05:07 +0000 en-US hourly 1 https://wordpress.org/?v=6.1.1 Sampling with zero intent http://blog.codalism.com/index.php/sampling-with-zero-intent/ http://blog.codalism.com/index.php/sampling-with-zero-intent/#respond Wed, 14 Mar 2018 23:05:07 +0000 http://blog.codalism.com/?p=2391 A zero intent sample is a sample which will only satisfy our validation goal if no positive examples are found in it. If we have a population (in e-discovery, typically a document set) where one in R instances are positive (one in R documents relevant), and we only want a one in Q probability of sampling a positive instance, then our sample size can be no more than R / Q.

In e-discovery, we're often dealing with collections or collection segments that have very few relevant documents in them. And not infrequently we need to prove that there are indeed very few relevant documents in these segments. For instance, to validate that our production has achieved adequate recall, we need to demonstrate that the number of found relevant documents exceeds the unfound relevant documents in the discard pile by some margin. Now, absolute proof can only be achieved by reviewing all documents in the discard pile (and even then we have to consider reviewer error); but probabilistic proof is achievable using a random sample. And with very large segments and very low richnesses, the proof we require can sometimes be provided only if the sample finds no relevant documents at all. I refer to this as "sampling with zero intent".

In designing zero-intent samples, the most important relationship is that between the sample size s and the richness of the segment (proportion of relevant documents within it) r, under the constraint that we want to have p confidence that the sample will not turn up any relevant documents. Assume p is pre-determined, say at 95%. For a given richness r, what is the largest sample size s that we can (as it were) risk? For a given sample size, what is the highest richness that is safe?

The exact answer to these questions involves solving the equation

    \[  p = (1 - r) ^ s \]

for the desired variable. A simple and reasonably accurate approximation, however, can be found from the observation that, for small r and s \ll 1/r,

    \[ (1 - r) ^ s \approx 1 - r s \]

(by taking the first two terms of the binomial expansion). Therefore:

    \[ 1 - p \approx r s \quad ; \]

a simple formula to solve indeed!

To express this in handy-rule-of-thumb form, let's talk about richness as a rate r ^ {-1} rather than a proportion r. That is, instead of saying "0.01% richness", let's say "1 relevant document for each 10,000 documents". Similarly, let's talk not about the probability of successfully finding to relevant documents in the sample, but instead the rate of failure, q' = 1 / (1 - p). This gives us:

    \[ s = r' / q' \quad  .\]

In words, the maximum sample size is the ratio between the rate of richness and the accepted rate of failure. So, for instance, if we expect that 1 in 10,000 documents are relevant, and we only want a 1 in 20 probability of finding a relevant document in the sample, then our sample size must be no greater than 500 = 10,000 / 20. (Putting this back into the binomial formula, we find the exact probability of failure, ignoring the finite population adjustment, is 4.9%.)

By itself, creating a sample design to avoid finding relevant documents is not that interesting: simply make the sample as small as possible. We should, however, also be quoting some measure of confidence, such as a confidence interval, and that introduces the countervailing incentive to increase the sample size. I hope to discuss this constraint, particularly as it affects recall estimates, in my next post.

http://blog.codalism.com/index.php/sampling-with-zero-intent/feed/ 0
Back from the other side http://blog.codalism.com/index.php/back-from-the-other-side/ http://blog.codalism.com/index.php/back-from-the-other-side/#respond Wed, 14 Mar 2018 10:51:18 +0000 http://blog.codalism.com/?p=2388 Well, after a couple of years at FTI, and some, ahem, self-funded gardening leave, I'm back to consulting---and to blogging! More from me soon.

http://blog.codalism.com/index.php/back-from-the-other-side/feed/ 0
Off to FTI: see you on the other side http://blog.codalism.com/index.php/off-to-fti-see-you-on-the-other-side/ http://blog.codalism.com/index.php/off-to-fti-see-you-on-the-other-side/#comments Sun, 18 Jan 2015 09:49:38 +0000 http://blog.codalism.com/?p=2385 Tomorrow I'm starting a new, full-time position as data scientist at FTI's lab here in Melbourne. I'm excited to have the opportunity to contribute to the e-discovery community from another angle, as a builder-of-product. Unfortunately, this means the end of this blog, at least in its current form and at least for now. Thanks to all my readers, commenters, and draft-post-reviewers. It's been an entertaining experience!

http://blog.codalism.com/index.php/off-to-fti-see-you-on-the-other-side/feed/ 2
Confidence intervals on recall and eRecall http://blog.codalism.com/index.php/confidence-intervals-on-recall-and-erecall/ http://blog.codalism.com/index.php/confidence-intervals-on-recall-and-erecall/#comments Sun, 04 Jan 2015 10:07:48 +0000 http://blog.codalism.com/?p=2353 There is an ongoing discussion about methods of estimating the recall of a production, as well as estimating a confidence interval on that recall. One approach is to use the control set sample, drawn at the start of production to estimate collection richness and guide the predictive coding process, to also estimate the final confidence interval. This requires some care, however, to avoid contaminating the control with the training set. Using the control set for the final estimate is also open to the objection that the control set coding decisions, having been made before the subject-matter expert (SME) was familiar with the collection and the case, may be unreliable.

Another approach, first suggested by Herb Roitblat, is to calculate what Herb calls eRecall. This involves drawing a sample from the null set at the conclusion of the production, and comparing the ratio between richness on this null set sample, and the richness of the original control set. Since we are using the control set sample only to estimate overall richness and yield, not the richness of either the production or the null set, contamination with training is not an issue (though the method does not address concerns about the reliability of the coding decisions made on the control set at the start of the review process). A confidence interval can be calculated on eRecall using the Koopman ratio-of-binomial method; this appears to be the method that Herb himself recommends.

An apparent problem with the eRecall estimator is that collection and null samples overlap on the null set. We are in effect estimating null set yield twice and independently, and using the two estimates on different sides of the recall division. A little reflection suggests that this should increase the variability of our estimator, making our point estimate less reliable, and our confidence intervals wider. It is possible, for instance, for the null set sample to lead to a larger estimate of yield than the collection sample, in which case our estimate of recall will be negative---a nonsensical result. While this is perhaps unlikely to occur in practice, it does suggest problems with the estimator.

Let's examine the accuracy of the eRecall estimator with a simple production scenario. Assume that we have a collection of 1 million documents with 5% richness, and that we achieve 75% recall with 50% precision (though of course we don't know this in advance of sampling and estimation). Now consider three methods of estimating a confidence interval:

  • eRecall / Koopman: a collection sample of 1500 (for instance, the control set sample), and a null set sample of 1500, for a total sample size of 3000.
  • Direct method: a simple random sample of 1500 from the entire collection post-production, with an exact binomial confidence interval on the retrieved proportion of the relevant documents that turn up in the sample.
  • Segmented / Koopman: proportionally allocate a sample of 1500 documents between the production and the null sets (for our scenario, this works out to 113 for the production, and 1387 for the null set), then calculate a confidence interval using my recci package with the Koopman method. (The BetaBinomial-Half method is the default for my package, but I use the Koopman here for comparability; the results are very similar for this scenario.)

Note that the latter two methods ignore the control set sample, even if we have one, and just use the 1500 sample that in the eRecall method is allocated to the null set.

I repeatedly sample 10,000 times for the above scenario and sampling allocations, and calculate a confidence interval for each sample by the three methods described. Across these 10,000 repeated simulations, I calculate the average lower bound and upper bound on a 95% confidence interval on recall, along with the average width of the interval end-to-end, and the width from the true recall value (0.75) to the lower bound. (The code for this experiment is here: it relies upon Version 0.5.0 of my recci package.) Here are the results:

  Mean CI
Method Lower Upper Width True - Lower
eRecall 0.589 0.842 0.253 0.161
Direct 0.637 0.843 0.206 0.113
Segmented 0.650 0.829 0.179 0.100

As our intuition suggested, the confidence interval on eRecall is much wider than necessary: 40% wider than the segmented interval, and 23% wider than even the simple direct method. Moreover, the additional width is largely on the downside: the eRecall lower bound is 60% further from the true value than the segmented estimator. And this is despite using an effective sample of twice the size. Put another way, it is better to throw away the original control set sample and resample from scratch, than to use the control set sample within the eRecall estimator. And not using the control set sample also removes concerns about the reliability of the coding decisions made prior to review.

We can use the same experiments to consider the accuracy of the point estimates on recall that the three sampling methods produce. The results are as follows:

  Point estimate
Method Mean Std.dev.
eRecall 0.747 0.0635
Direct 0.751 0.0466
Segmented 0.751 0.0466

All methods are approximately unbiased, but the eRecall has considerably higher variance than the others. And again, eRecall has twice the effective sample size of the other methods, and brings with it concerns about the reliability of the control set labels.

These figures are for a single scenario only. I suspect that the performance of eRecall would be worse for lower collection richness, relatively somewhat better for higher. Nevertheless, in the light of these results, the use of the eRecall estimator cannot be recommended, either for point estimates or confidence intervals.

http://blog.codalism.com/index.php/confidence-intervals-on-recall-and-erecall/feed/ 3
Why training and review (partly) break control sets http://blog.codalism.com/index.php/why-training-and-review-partly-break-control-sets/ http://blog.codalism.com/index.php/why-training-and-review-partly-break-control-sets/#comments Mon, 20 Oct 2014 04:22:35 +0000 http://blog.codalism.com/?p=2344 A technology-assisted review (TAR) process frequently begins with the creation of a control set---a set of documents randomly sampled from the collection, and coded by a human expert for relevance. The control set can then be used to estimate the richness (proportion relevant) of the collection, and also to gauge the effectiveness of a predictive coding (PC) system as training is undertaken. We might also want to use the control set to estimate the completeness of the TAR process as a whole. However, we may run into problems if we attempt to do so.

The reason the control set can be used to estimate the effectiveness of the PC system on the collection is that it is a random sample of that collection. As training proceeds, however, the relevance of some of the documents in the collection will become known through human assessment---even more so if review begins before training is complete (as is often the case). Direct measures of process effectiveness on the control set will fail to take account of the relevant and irrelevant documents already found through human assessment.

A naïve solution to this problem to exclude the already-reviewed documents from the collection; to use the control set to estimate effectiveness only on the remaining documents (the remnant); and then to combine estimated remnant effectiveness with what has been found by manual means. This approach, however, is incorrect: as documents are non-randomly removed from the collection, the control set ceases to be randomly representative of the remnant. In particular, if training (through active learning) or review is prioritized towards easily-found relevant documents, then easily-found relevant documents will become rare in the remnant; the control set will overstate effectiveness on the remnant, and hence will overstate the recall of the TAR process overall. (If training has been performed purely by random sampling, though, and no other review has been undertaken, then this approach is roughly accurate.)

To restore the representativeness of the control set, we need to remove from it the equivalent documents to those that have been removed from the collection by review. There is, however, no general way that of doing this: we can’t in general say which of the control set documents would have been reviewed, had they been part of the collection. There are, it is true, particular cases in which this removal rule can be determined. For instance, if a set of documents were selected for training or review by a keyword query, then the same documents should be excluded from the control set. Even if all documents selections were by similar rules, however, keeping track of all these rules over the course of a TAR process quickly becomes infeasible.

One case in which control set exclusion can be performed is prospectively, based upon a review cutoff decision. If we undertake to review all documents with a certain relevance score or higher (and if all the documents already reviewed are above that relevance score), then the part of the control set with relevance scores below this threshold could be used to estimate the number of relevant documents that would be excluded from the review. We cannot, however, form a statistically valid estimate of the number of relevant documents above this threshold, unless we ignore the evidence of the above-cutoff documents that have already been reviewed. (To see this, consider that the estimated number of relevant documents above the threshold might be less than the number we had already seen in review.) We also have no principled way of handling any reviewed documents that happen to fall below the threshold, except to make worst-case assumptions (in particular, to exclude such documents in estimating recall).

The above points do not mean that control sets are not useful in guiding decisions about a TAR process, including the important decision of when to stop training and how much further review might be required. However, care must be taken in deriving guidance from control sets in the presence of training and review, and even then the resulting estimates should be recognized as not being statistically valid ones (that is, not strictly justified by sampling theory). In particular, practitioners should be wary about the use of control sets to certify the completeness of a production---besides the sequential testing bias inherent in repeated testing against the one control set, and the fact that control set relevance judgments are made in the relative ignorance of the beginning of the TAR process. A separate certification sample should be preferred for making final assessments of production completeness.

http://blog.codalism.com/index.php/why-training-and-review-partly-break-control-sets/feed/ 2
Total assessment cost with different cost models http://blog.codalism.com/index.php/total-assessment-cost-with-different-cost-models/ http://blog.codalism.com/index.php/total-assessment-cost-with-different-cost-models/#comments Thu, 16 Oct 2014 00:24:55 +0000 http://blog.codalism.com/?p=2305 In my previous post, I found that relevance and uncertainty selection needed similar numbers of document relevance assessments to achieve a given level of recall. I summarized this by saying the two methods had similar cost. The number of documents assessed, however, is only a very approximate measure of the cost of a review process, and richer cost models might lead to a different conclusion.

One distinction that is sometimes made is between the cost of training a document, and the cost of reviewing it. It is often assumed that training is performed by a subject-matter expert, whereas review is done by more junior reviewers. The subject-matter expert costs more than the junior reviewers---let's say, five times as much. Therefore, assessing a document for relevance during training will cost more than doing so during review.

With this differentiated cost model, we get the following average assessment costs (see my previous post for the interpretation of this table):


Richness Random Uncertainty Relevance Continuous
< 0.05% 19636.52 19010.29 19015.27 58589.09
0.05% -- 0.5% 211.56 50.72 52.14 61.69
0.5% -- 2% 23.65 7.34 9.64 11.68
2% -- 5% 5.04 3.76 6.55 8.50
5% -- 15% 2.22 1.88 4.34 6.54
> 15% 1.09 1.07 1.30 5.26


As one might expect, relevance selection (which aims to do much or all of the review effort during training) becomes more expensive than uncertainty selection. Moreover, whereas with a uniform cost model relevance selection was almost always best done in continuous mode (that is, all review is done by the trainer), it is frequently better with differentiated costs to stop relevance selection early and leave the tail of the documents for review.

This differentiated cost model assumes that (junior) reviewer decisions are final. In practice, however, that seems unlikely: if we trusted the cheaper junior reviewers so much, why not have them do the training as well? That suggests a third, QC step needs to be applied prior to production, giving us a three-step process of training, review, and production (or T-R-P). A simple form of pre-production QC would be that documents marked relevant by the junior reviewer are then sent for second-pass review by a more senior reviewer---perhaps the SME himself. In that case, we have three costs:

  1. Training documents: viewed once by SME, 5x cost
  2. Irrelevant reviewed documents: viewed once by junior reviewer, 1x cost
  3. Relevant reviewed documents: viewed first by junior reviewer, then by SME, 6x cost

Under this protocol, it is cheaper to handle a relevant document in training than in review (assuming that judgments made during training do not have to be checked during production). With this cost model, the total review costs of the different selection methods look like so:


Richness Random Uncertainty Relevance Continuous
< 0.05% 19641.38 19012.72 19018.25 58588.96
0.05% -- 0.5% 216.33 51.97 52.52 61.64
0.5% -- 2% 28.54 10.47 10.35 11.63
2% -- 5% 9.99 7.95 7.65 8.44
5% -- 15% 7.19 6.66 6.16 6.48
> 15% 6.08 6.05 5.17 5.18


Now, relevance selection is on average slightly cheaper than uncertainty selection, because there is less double-handling of responsive documents.

In the three-step, T-R-P protocol, the goal of the R step (of first-pass review) is to filter out non-responsive documents; that is, to increase precision. For the experiments described above, however, responsive documents make up the bulk of the assessment load. For topics with richness above 0.5%, average precision across the assessment process is around 65%. That seems to be substantially higher than what many vendors report for practical e-discovery cases. Perhaps the RCV1v2 collection is unrepresentatively "easy" as a text classification target. In these conditions, the R step has reduced value, as there are few irrelevant documents to filter out. If precision were lower---that is, if the number of non-responsive documents seen in assessment were higher---then the relative costs would be likely to change.

As noted above, however, the junior reviewers used in the R step will make errors. (The SME will make errors, too, but hopefully at a lower rate, and with more authority.) In particular, they will make errors of two types: false positives (judging irrelevant documents as relevant); and false negatives (judging relevant documents as irrelevant). The P-step approach described above, of only considering judged-relevant documents in second-pass review, will catch the false positives, but will miss the false negatives, lowering true recall. In practice, one would want to perform some form of QC of the judged-irrelevant documents, too; conversely, some proportion of the judged-relevant documents might be produced without QC.

The cost models presented in this post are simplistic, and other protocols are possible. Nevertheless, these results emphasize two points. First, conclusions about total assessment cost depend upon your cost model. But second, your cost model depends upon the details of the protocol you use (and other statistics of the assessment effort). Modeller beware!

Thanks to Rachi Messing, Jeremy Pickens, Gord Cormack, and Maura Grossman for discussions that helped inform this post. Naturally, the opinions expressed here are mine, not theirs.

http://blog.codalism.com/index.php/total-assessment-cost-with-different-cost-models/feed/ 1
Total review cost of training selection methods http://blog.codalism.com/index.php/total-review-cost-of-training-selection-methods/ http://blog.codalism.com/index.php/total-review-cost-of-training-selection-methods/#comments Fri, 26 Sep 2014 21:21:56 +0000 http://blog.codalism.com/?p=2218 My previous post described in some detail the conditions of finite population annotation that apply to e-discovery. To summarize, what we care about (or at least should care about) is not maximizing classifier accuracy in itself, but minimizing the total cost of achieving a target level of recall. The predominant cost in the review stage is that of having human experts train the classifier, and of having human reviewers review the documents that the classifier predicts as responsive. Each relevant document found in training is one fewer that must be looked at in review. Therefore, training example selection methods such as relevance selection that prioritize relevant documents are likely to have a lower total cost than the abstract measure of classifier effectiveness might suggest.

In this post, I am going to evaluate three selection methods—random, uncertainty, and relevance—by the total review cost required to achieve a target recall. For simplicity, I'll assume that every document costs the same amount to be labelled, whether in training or review. I'll set the target recall level at 80%, and refer to the cost metric as cost for recall 80 (cfr80). (Cormack and Grossman refer to the same measure as "80% recall effort".) A run consists of a certain amount of training, followed by review to the depth that achieves 80% recall. The total cost of a run is the training set size plus the review depth. A sequence of (potential) training iterations defines a set of runs, each of which switches from training to review at a given iteration, and each of which has its own cost. The run with the lowest cost is the one with the optimal point of switching from training to review (that is, the optimal training set size). The cost of the set of runs can be represented as a curve; the minimum cost is then the lowest point in this curve. One could also continue training until target recall was achieved in the training set alone, as in continuous training; this strategy is particularly attractive for the "relevance" selection method. An interesting question, then, is if the cfr80 achieved by relevance selection under continuous training is the minimum cfr80 for its run set.

Each training sequence begins with a randomly-sampled seed set. For the uncertainty and relevance selection methods, random sampling continues until at least one relevant document enters the training set, at which point we switch to the appropriate selection method (for random selection, of course, random sampling continues throughout). As with previous experiments, the RCV1v2 dataset with its 103 topics is used again. As implied by the finite population scenario, training and target sets are not held distinct; instead, I take 700,000 of the 804,414 RCV1v2 documents as the single, undifferentiated collection. (I'm holding the remaining 104,414 documents in reserve for experiments on sampling and estimation.) The classifier is Vowpal Wabbit 7.4, with logistic loss, a learning rate of 2.0, and a single pass. Training iterations are by steps of 100 to 2000; then of 200 to 5,000; then of 500 to 20,000; and finally of 2,000 to the end.

Let's begin with the cost curves for a particular topic, Topic E41.

Relative cost for 80% recall curve for Topic E41

Relative cost for 80% recall curve for Topic E41

There's a fair bit of information in this figure, so we'll take things bit by bit. The title provides the richness of the topic, along with the number of relevant documents that need to be located to achieve our target of 80% recall. (This target number is richness * 700,000 * 0.8, with 700,000 being the collection size.) The x axis states the training set size as a multiple of the target size, to facilitate comparison between topics of different richnesses. The y axis gives the cost to achieve 80% recall, again as a multiple of target size. (Note the log scale on the y axis, giving greater clarity to the curve minima.) The absolute minimum of relative cost for recall is 1.0, which would occur if all the documents seen, both in training and review, were relevant.

The green, blue, and red curves provide the relative cost for recall of the uncertainty, random, and relevance selection methods, given the training set size in the x axis. Lower values on this curve are better (indicating less total review cost). Observe that the (green) curve for the uncertainty method falls sharply with the initial training runs, indicating the quality of the predictive ranking rapidly improves, and so review depth quickly decreases. The improvement slows over time, however, and after a relative training set size around 0.5 (that is, a training set around 11,863 * 0.5 ≈ 6,000), the total cost curve begins to rise. This does not (necessarily) mean that the quality of the predictive ranking is actually getting worse; rather, it indicates that the cost in training exceeds the saving in review. In contrast, the cost curve for the relevance selection method decreases much more slowly; until relative training set size is almost 1.0 (≈ 11,000), relevance selection does not consistently beat random selection. After that, though, it drops quickly, and goes below the uncertainty sampling curve.

The trajectories of the cost curves for the different methods illustrate how the methods behave; what we really care about, however, is the minimum point in each curve, as this gives the actual total review cost that would be achieved (assuming, as we do throughout this post, perfect knowledge). These minimum costs are marked as notations along the y axis. Here, we can see that the minimum cfr80 values of the uncertainty and relevance methods are quite similar (the actual minimum relative cfr80 values are 1.36 and 1.41 respectively), and certainly both are much lower than the minimum for random selection (4.08). The protocol that achieves this minimum for uncertainty selection is to spend around 40% of the effort in training, and 60% in review; for relevance selection, to spend most of or all the effort in training; but total cost is much the same either way.

A question posed in my previous post was whether the relevance selection method did indeed achieve its optimal cost by following a continuous training protocol (that is, continuing to train until target recall is achieved in the training set, without a separate review step). This question can be answered (somewhat indirectly) by considering the line at which training set size and cfr80 are the same (shown as the dotted grey curve in the figure---curved because of the log y axis). When the cost curve of a method touches or cross this line, the training set by itself achieves target recall. Here, the relevance cost curve touches the train==cost curve at or near the cost curve's minimum, indicating that continuous training achieves
(near-)minimum cost on this set of runs.

Having understood the figure format for an example topic, let's look at a larger set of topics.

Relative cost for 80% recall curves for select topics.

Relative cost for 80% recall curves for select topics.

The above figure shows the cost curves for 14 randomly selected topics. (Selection was as follows: 3 randomly selected from topics with richness < 0.5% [top row]; 3 from topics with richness between 0.5% and 2% [second row]; 3 between 2% and 5% [third row]; 3 between 5% and 15% [fourth row]; and two above 15% richness [bottom row].) Cost curves for all 103 topics can be found here.

Considering the cost curves themselves, the general pattern is for uncertainty and relevance selection to track each other very closely for topics with richness below 1%. For richness above 1%, the uncertainty cost curve tends to fall faster and earlier than the relevance curve, with this tendency becoming more pronounced as richness increases; indeed, for topics with richness above 5%, relevance selection frequently leads to initially rising cost curves.

Considering minimum review cost, however, the difference in effort between relevance and uncertainty selection is only slight; only 3 topics show a difference of more than 13% one way or the other. That is, relevance and uncertainty selection have different trajectories, increasingly so for higher richness, calling for different protocols (for uncertainty, limited training followed by review; for relevance, continuous training throughout), but end up with very similar total costs.

Additionally, there are only three topics, all low richness ones (G159 [0.00% rich], E61 [0.05% rich], and C16 [0.24% rich]), in which the optimal stopping point for relevance selection is well before (> 10%) the stopping point for continuous training (that is, when recall is achieved by the training set alone). That is, continuous training is indeed the correct protocol to use for relevance selection. (Moreover, particularly for higher prevalence topics, using relevance selection without following continuous ranking through to the end can lead to very poor results.) For around a third of the topics, continuous training also gives total cost within 10% of optimal for uncertainty selection, too, but these are almost exclusively low richness topics (richness < 1%), where, as previously observed, the uncertainty and relevance cost curves are very similar. (It may well be that the two methods in fact uncover very much the same relevant documents during training, though I have not yet verified this.)

Meanwhile, it is clear from these graphs that random selection is a very expensive method for topics with low richness. It is only for topics with richness above 2.5% that one can be reasonably confident that random selection will not lead to more than twice the total review cost of uncertainty selection, and only for topics with richness above 10% that the additional cost is consistently less than 20%. There are benefits in random selection beyond total cost for binary recall—its simplicity, its lack of bias, the ability (with care) to use training data for estimation and testing purposes—but I would suggest that once richness goes substantially below 5%, these benefits are outweighed by the additional cost.

Finally, let's summarize these minimum total cost results, grouped by topic richness. Again, total cost is expressed as a multiple of the number of relevant documents needed to achieve 80% recall:

Richness Random Uncertainty Relevance
< 0.5% 2820.27 3003.53 3006.39
0.5% -- 2% 10.26 2.24 2.26
2% -- 5% 3.14 1.64 1.66
5% -- 15% 1.71 1.26 1.28
> 15% 1.06 1.04 1.03

Clearly, all methods struggle with extremely low richness topics. (For such topics, we really require a better seed set selection method than random sampling, a question that I hope to return to in a later post). There is no discernible difference in total cost between relevance and uncertainty selection. And encouragingly, the total review cost for these methods is only a little above 2 times the number of relevant documents for low richness topics, and less for higher ones (albeit on what is, in e-discovery terms, a very clean collection with relatively straightforward topics). Random selection does almost as well as the active methods for high richness topics, but becomes increasingly inefficient as topic richness falls.

Although there is little difference in minimum total cost between uncertainty and relevance selection, the trajectories by which the two methods achieve these similar total costs are quite different. As mentioned in my previous post, the continuous training method that relevance selection supports has a simpler stopping decision than uncertainty selection (one-, rather than two-dimensional). On the other hand, the generally flat minimum of uncertainty selection, combined with the gentle rise in total cost once this minimum is passed, suggests that the first stopping decision in uncertainty selection (that is, when to stop training) may not be too sensitive, while uncertainty selection may be more flexible in allowing the separation of training and review. The cost curves also suggest the attractiveness of a composite strategy, for instance uncertainty selection at the start, moving to relevance selection later on (perhaps joined with some random selection for high-richness topics).

It is important to note, however, that the above discussion paints an overly clean picture, because of the assumption of perfect knowledge of cost for recall at any point, and of the trajectory of that cost with further training. In reality, we lack such perfect knowledge, and have to rely on estimation methods, rules of thumb, and experience. Actual decisions about strategy and stopping points are made in a fog of uncertainty; and this fog becomes thicker as richness drops and sampling methods lose traction. This, too, will be a subject of future posts.


In the above table showing average costs by different selection methods for topics of different richness bands, the random method is shown as cheaper on average than the other two methods for topics with richness below 0.5%. In fact, random is considerably more expensive on almost all such topics. However, on the two topics with the lowest richness, random is slightly cheaper; and the costs of these two topics are so high, that these two topics outweigh all the others in the average. Now, these two topics have 5 and 31 relevant documents respectively out of the population of 700,000; that is, they have richness below 0.01%. Why the random method does better on these two topics is unclear; it may simply be due to chance. In any case, such extremely low richness topics are doubtful choices for predictive coding altogether, and certainly very poor choices for random seed sets. If place topics with a richness lower than 0.05% (1 document in 2,000) in a separate category, we get the following table:

Richness Random Uncertainty Relevance
< 0.05% 11349.83 12324.16 12312.65
0.05% -- 0.5% 78.62 11.41 11.32
0.5% -- 2% 10.26 2.24 2.26
2% -- 5% 3.14 1.64 1.66
5% -- 15% 1.71 1.26 1.28
> 15% 1.06 1.04 1.03

Thanks to Ralph Losey for pointing out this anomaly (and urging me to do something about it!).

http://blog.codalism.com/index.php/total-review-cost-of-training-selection-methods/feed/ 9
Finite population protocols and selection training methods http://blog.codalism.com/index.php/finite-population-protocols-and-selection-training-methods/ http://blog.codalism.com/index.php/finite-population-protocols-and-selection-training-methods/#respond Mon, 15 Sep 2014 01:17:00 +0000 http://blog.codalism.com/?p=2212 In a previous post, I compared three methods of selecting training examples for predictive coding—random, uncertainty and relevance. The methods were compared on their efficiency in improving the accuracy of a text classifier; that is, the number of training documents required to achieve a certain level of accuracy (or, conversely, the level of accuracy achieved for a given number of training documents). The study found that uncertainty selection was consistently the most efficient, though there was no great difference betweein it and relevance selection on very low richness topics. Random sampling, in contrast, performs very poorly on low richness topics.

In e-discovery, however, classifier accuracy is not an end in itself (though many widely-used protocols treat is as such). What we care about, rather, is the total amount of effort required to achieve an acceptable level of recall; that is, to find some proportion of the relevant documents in the collection. (We also care about determining to our satisfaction, and demonstrating to others, that that level of recall has been achieved—but that is beyond the scope of the current post.) A more accurate classifier means a higher precision in the candidate production for a given level of recall (or, equivalently, a lesser cutoff depth in the predictive ranking), which in turn saves cost in post-predictive first-pass review. But training the classifier itself takes effort, and after some point, the incremental saving in review effort may be outweighted by the incremental cost in training.

In fact, the real goal is not to find a certain proportion of the collection's relevant documents in (post-predictive) review, but to find that proportion by whatever means available. Put another way, the collection is of finite size, and so our recall target amounts to finding a finite number of relevant documents (though knowing what this number is is a tricky question in practice); a task Dave Lewis refers to as finite population annotation. Each relevant document found prior to review—in particular, in training—is one fewer that has to be found in the review stage. (Of course, the precise cost saving depends upon the protocol employed. If we have expensive experts train, and cheap reviewers do the first-pass review, then the saving is less than 1:1. If, though, every relevant document found in first-pass review is then checked by an expert reviewer in second-pass review, the saving may be greater than 1:1.)

Going further, the (near-)equivalence of finding relevant documents in training or review can be taken to its logical conclusion by entirely removing the separate review stage, and continuing the training until all relevant documents are found. Such a process is referred to as "continuous ranking" by Catalyst, or "continuous (active) learning" by Cormack and Grossman. Other vendors refer to it as "prioritization" (though frequently the prioritization stage follows a more regular training stage, and may be performed as first+second pass review rather than single-reviewer training). Relevance selection is primarily motivated by this mode of processing, though it might be that mixing in other selection types would give better efficiency (this certainly is Catalyst's approach), or conversely that periods of relevance selection would be optimal in a process with separate training and review steps. (Another way of looking at this process, when used with relevance selection, is that we skip training and start review immediately, but then feed back the reviewed documents to a predictive coding system to improve the review order.)

With the above discussion in mind, we can identify three possible (not necessarily mutually exclusive) protocols to achieve a target level or recall:

  1. Train until classifier accuracy "plateaus" (that is, stops improving noticeably); then review the subsequent predictive ranking to the depth required to achieve recall. I'll refer to this as "train–plateau–review", or TPR for short.
  2. After each round of training, look at the current predictive ranking, and try to determine whether it would be more efficient to do more training to improve this ranking, or to review the ranking as it is. I'll refer to this as "train–rank–review", or TRR for short.
  3. Continue "training" until you've achieved target recall in the positive training examples alone. I'll refer to this protocol as "continuous training", or CT for short.

(A fourth—or perhaps we should say zero'th—protocol is widely used, which is to train to the plateau, then review all and only the documents to which the classifier gives a positive prediction of relevance. This protocol, however, does not guarantee that target recall is achieved; one has to be satisfied with whatever level of recall is embodied in the classifier's set of positive prediction. The other protocols give us the option of sacrificing precision—that is, reviewing further down the predictive ranking—in order to achieve target recall.)

Note that these three protocols involved stopping decisions of differing complexities. The CT protocol has a single-dimensional stopping decision: have we achieved recall in the training set? The TRR protocol requires a two-dimensional stopping decision: how far down the current ranking do we have to go to achieve recall; and would additional training reduce this depth by more than the cost of the training itself? (I owe this distinction and its visualization in terms of dimensions to Gord Cormack). The complexity of the TPR protocol is intermediate, and can be likened to two sequential one-dimensional decisions: first, has the classifier stopped improving?; and second (and only then), what cutoff in the resultant ranking achieves recall?

The significance of TRR's greater stopping-rule complexity over CT depends on the sensitivity of the overall cost to an inaccurate choice in TRR's first dimension; that is, to training moderately too little or moderately too much. If the sensitivity is not great, then TRR's additional complexity is not of great concern; we might, for instance, look at total cost estimates in training iterations to date, and stop training when these estimates start trending up (an approach which would also assume that the historical cost curve is relatively smooth; that is, without sharp local minima that are distant from the global minimum). Furthermore, if it turns out that optimal training amount for TRR generally occurs near when classifier effectiveness plateaus, then the TPR and TRR protocols differ little in their practical results.

(Of course, the complexity also depends upon the method we use for estimating target recall. I'm avoiding the direct tackling of that problem here, though the general idea I have in mind is that we're using control-set sampling methods for estimation. In practice, all of these decisions will be made in a fog of uncertainty induced by our only seeing what happens on the sample control set, not on the full population. As a recent line of discussion has emphasized, this fog becomes increasingly thick as collection richness becomes increasingly low. Grossman and Cormack suggest in their recent law journal article that observed collection richness is dropping so low that reliable sampling becomes too expensive and we are forced to incorporate other sources of evidence to determine a defensible stopping point; if so, the question of stopping rule complexity becomes even more vexed.)

Hopefully, the above discussion has not been too abstract to follow. We'll be asking the above questions of real experimental data in subsequent posts, which should make the issues more concrete. This post, however, should not be regarded as a mere preliminary to the experimental work. Understanding the differences between these protocols, how they interact with selection methods, and their implications for practice is certainly fundamental to, but perhaps in itself more important than, the particular answers that experimental results produce. If we ask the wrong questions, we'll only end up with unhelpful answers.

Thanks to (in reverse alphabetical order) Jeremy Pickens, Dave Lewis, Paul Hunter, Maura Grossman, and Gord Cormack for their helpful comments on this post and general discussion of these issues. Of course, this post represents my opinions, not necessarily theirs.

http://blog.codalism.com/index.php/finite-population-protocols-and-selection-training-methods/feed/ 0
Research topics in e-discovery http://blog.codalism.com/index.php/research-topics-in-e-discovery/ http://blog.codalism.com/index.php/research-topics-in-e-discovery/#comments Fri, 08 Aug 2014 01:08:48 +0000 http://blog.codalism.com/?p=2171 Dr. Dave Lewis is visiting us in Melbourne on a short sabbatical, and yesterday he gave an interesting talk at RMIT University on research topics in e-discovery. We also had Dr. Paul Hunter, Principal Research Scientist at FTI Consulting, in the audience, as well as research academics from RMIT and the University of Melbourne, including Professor Mark Sanderson and Professor Tim Baldwin. The discussion amongst attendees was almost as interesting as the talk itself, and a number of suggestions for fruitful research were raised, many with fairly direct relevance to application development. I thought I'd capture some of these topics here.

1. Classification across heterogeneous document types

Most work on text classification to date has been performed upon homogeneous document collections such as RCV1v2---that is, collections in which all of the documents are of much the same type, format, and length (in the case of RCV1v2, all news reports). In e-discovery, however, we're faced with a very heterogeneous set of documents: emails, reports, spreadsheets, presentations, pictures, and more. And each of these types can be divided into subtypes: personal versus business emails; contracts versus business reports; numerical spreadsheets versus tabular arrangement of predominantly textual data; and so forth. There are also differences of format that cross these type boundaries: OCR'ed versus native-format versus plain text, for instance. And along with differences in type come differences in other documents features: 100-page reports mixed in with 100-character emails; Chinese documents alongside English ones. Some coarse-grained separation of types is often performed manually, based frequently upon eliminating document types that text classifiers are believed to perform poorly on (images are generally pulled out for manual review, and spreadsheets may join them). A heterogeneous mix of document types remains to be fed into the text classifier, however. We need more research on the use and effectiveness of classification technology in these environments. Can the documents be classified effectively, and without bias against certain document types, by treating them as if they were an undifferentiated set? Do our existing, rather ad-hoc length normalization techniques work when faced with such enormous differences in document length? Do we need to build separate classifiers for each document type---or, since that seems both wasteful and subject to categorization errors, can a committee of classifiers share some training data while specializing in different types?

2. Automatic detection of document types

To some extent, document types can be detected by file extensions or other signature data. But, as mentioned before, there are still subtypes within each of these file types---and subtypes that cross file type boundaries (contracts, for instance, could be in Word format, or PDF, or OCR'ed from images). There is also a certain degree of commonality between document types across different organizations (most corporate email repositories contain both business and personal emails), though there will also be categories that only occur in certain collections (source code for software firms; CAD drawings for engineering firms), and the degree of granularity will vary between firms and cases ("contracts" may be a sufficient category for a baked goods manufacturer, but different types of contracts should probably be differentiated for a law firm). Identifying these document types can help with early case analysis; with culling and collection decisions; and with classification itself (particularly, of course, if we find we need to specialize our classifiers to different document types).

3. Faceted categorization

Most text categorization (supervised or unsupervised) technology assumes that categories are all of the one type, or at most are hierarchical. But in e-discovery, as we've seen above, at least two different types of categorization are of interest: by content, and by type of document. Conflating these two category types can lead to unhelpful confusion in outputs: document clustering schemes, for instance, in which one cluster represents "on-site inspection reports", and another represents "documents that have been OCR'ed" (with OCR'ed inspection reports possibly fitting into the former, possibly into the latter category). A simple solution to this issue would be to perform two categorizations, one by content, the other by type, each presumably with different feature sets. A fuller solution, however, would be able to factor out differences correlated with one facet when making the categorization in the other (factor out differences caused by OCR'ing, for instance, when trying to identify all the on-site inspection reports).

4. Label propagation across related documents

Text classification and categorization work generally assumes that each document is an entirely distinct entity; that the only relations to be found between documents are those identified by the algorithm itself. In e-discovery, however, there are many different types of connection between documents, including duplicates and near-duplicates, document families, and threads. Different cases will have different formal rules about label propagation (for instance, that if any document in a family is responsive, then the whole family is responsive). These associations also need to be taken into account in designing classification algorithms. When learning from or predicting the label of an email, for instance, how should we account for the other emails in the thread? How should we treat quoted reply text---or the absence of such reply text? Dave Lewis points out that there has been work on collective classification (for instance, using conditional random field models), but this is yet to be applied to e-discovery.

5. Identifying unclassifiable documents

In binary classification systems, there are (by definition) only two categories: for e-discovery, typically "responsive" or "non-responsive". For predictive rankers, documents are assigned degrees of membership to these classes, but the ranking is still along the real number line, for different degrees of responsiveness or non-responsiveness. In e-discovery, though, we really require a third category, of uncategorizable: that is, of documents that the classification system not only can't categorize well now (as detected in active learning), but likely will never be able to categorize adequately. Such documents should then be removed from the classification system and sent for manual review (or perhaps routed to a more specialized classification sub-system). As mentioned before, this distinction is often made in a coarse, manual way at collection time by file type (images don't go to predictive coding, etc.). But many difficult-to-categorize documents still get through, and in any case it is preferable to have automated fail-safes in case manual processes fail.

6. Identifying poor training examples

Some documents make poor training examples. Many systems currently require humans to identify which documents make bad examples, but that is a brittle process, as humans cannot always tell when a document is a bad example or not. Identifying documents that are not helpful in training the classifier is likely related to identifying documents that the classifier is not going to be helpful in predicting responsiveness for, though the relationship may be complex: a labelled training example may seem to provide evidence to classify a set of similar documents, but this evidence may be misleading.

7. Identifying significant fragments in non-significant text

Sometimes, only a section of a document is significant (in particular, responsive), while the rest is unimportant or boilerplate. This distinction is sometimes made at training time, when the trainer is asked to highlight the section of a document that is highly responsive (which, as with identifying poor training examples, is a fraught process); but the distinction is also relevant at prediction time. An important sub-category of this problem is identifying repetitive boilerplate in a set of documents and dealing with it appropriately at classification time. There may be a standard format for contracts, for instance, in which only a small number of conditions are varied; or the organization may have standardized reports, where directions and field names are fixed, and only field values vary. If one version of (for instance) a contract is labelled as non-responsive, then a naive classifier is likely to regard all other versions of the contract as non-responsive, due to the weight of common text; but it may be the differing text that is crucial (perhaps the first contract was with an irrelevant company, but another is with a company related to the dispute). It is not sufficient, however, just to ignore the common text altogether, because then the human trainer is going to end up seeing the contract again and again, with no way of indicating that "no contract of this type is responsive". Near-duplicate systems commonly detect this condition at review time, and the better systems will ask the one reviewer to review all such similar documents at once, with the differing text highlighted. Possibly a similar system needs to be implemented during training.

8. Routing of documents to specialized trainers

It is common to rout documents by type to specific reviewers, as well as to cluster documents prior to review so that the one reviewer is reviewing similar documents at the one time. A similar routing could be performed on training documents, if there were multiple trainers. For instance, if we were able to detect more difficult documents, these could be routed to more senior experts. Such a routing could work particularly well in the continuous learning framework advocated by Gord Cormack and Maura Grossman and implemented by Catalyst, in which there is no separate training step, but all reviewed documents are using for training, and the trained classifier is used to prioritize review. An objection to the continuous learning approach is that junior reviewers make too many mistakes to be reliable trainers; if there were a method for routing difficult documents to more senior reviewers, that objection might be allayed.

9. Total cost of annotation

Existing work on the cost of text classification is limited, and is mostly focused on the tradeoff between number of training annotations and rate of increase in the learning curve. In e-discovery, however, the cost formula is more complex and more interesting. To begin with, we are building a classifier for a fixed, known population, not for general use on an unknown future population (in technical terms, we are building a transductive, not an inductive, classifier). Moreover, training and prediction sets are the same; each document annotated in training is one fewer documents whose responsiveness must be predicted (a condition that Dave Lewis refers to as "finite population annotation"). In addition, in most e-discovery practice, no document is produced without human review, so a document trained is a document that does not have to be reviewed. For these reasons, building the optimal classifier may not be the optimal solution to the e-discovery problem---something that the advocates of continuous learning assert. And to make the equation even more interesting, our annotation budget needs to be split between training and testing: allocating a greater proportion to training will increase the actual effectiveness of our classifier, but decrease our ability to accurately measure that effectiveness; conversely, allocating a greater proportion to testing will give us a more accurate measure of a less accurate classifier.

Please do let me know if you know of any existing predictive coding systems that offer solutions to these problems; or if there are other important research topics in e-discovery that I have missed.

Thanks to Dave Lewis and Paul Hunter for their comments on a draft version of this post.

http://blog.codalism.com/index.php/research-topics-in-e-discovery/feed/ 1
Random vs active selection of training examples in e-discovery http://blog.codalism.com/index.php/random-vs-active-selection-of-training-examples-in-e-discovery/ http://blog.codalism.com/index.php/random-vs-active-selection-of-training-examples-in-e-discovery/#comments Thu, 17 Jul 2014 06:02:38 +0000 http://blog.codalism.com/?p=2089 The problem with agreeing to teach is that you have less time for blogging, and the problem with a hiatus in blogging is that the topic you were in the middle of discussing gets overtaken by questions of more immediate interest. I hope to return to the question of simulating assessor error in a later post, but first I want to talk about an issue that is attracting attention at the moment: how to select documents for training a predictive coding system.

The catalyst for this current interest is "Evaluation of Machine Learning Protocols for Technology Assisted Review in Electronic Discovery", recently presented at SIGIR by Gord Cormack and Maura Grossman. In fact, there are two questions, somewhat conflated in Maura and Gord's paper. The first is, should the initial seed set of training examples be selected by random or by judgmental sampling (for instance, based on a keyword query)? The second is, having seeded the training cycle, how should subsequent training examples be selected? I'll tackle the second issue here; I hope to return to the first in a later post.

Gord and Maura describe three ways that we could select post-seed training documents. These are:

  1. Random Draw a simple random sample of the unlabelled documents in the collection.
  2. Uncertainty Select those documents whose responsiveness the predictive coding system is most uncertain about.
  3. Relevance Select the unlabelled documents the predictive coding system thinks are most likely to be responsive.

Both Uncertainty and Relevance are forms of active learning, in which the output of the predictive coding system guides the choice of training examples. In contradistinction, random selection is sometimes referred to as passive learning. There are other possible methods (for instance, we could ask the subject matter expert to search for new training examples; or one could seek to diversify the selection), and variations and mixtures of the above. But here we will look at these three only, and each in its simplest form.

All three of the above methods are used and advocated in e-discovery practice. Uncertainty is perhaps the predominant mode, but Herb Roitblat (among others) is a supporter of random or passive selection. Relevance-based selection, meanwhile, is use in the "continuous active learning" advocated in Gord and Maura's paper, and is a component of continuous ranking method deployed by Catalyst (along with a mix of complementary selection methods).

There are also two different criteria to consider in evaluating selection methods:

  1. Which method improves classifier accuracy with the least training documents?
  2. Which method achieves production recall with the minimum total coding effort (both training and review)?

In protocols where no document is produced without human review, these criteria are distinct. Here, the goal of classifier improvement is reduced review effort; but at some point, the marginal reduction in review cost may be less than the additional cost of training. Moreover, each likely-responsive training document is one fewer document to be reviewed. This is the motivation of the continuous approach: training and review are the same (continuous) process, and likewise this process "continues" until sufficient relevant documents are found. Nevertheless, in this post, I am only going to consider which selection method improves classifier accuracy most efficiently. Besides avoiding the complexity of a fuller cost model, this choice allows us to analyze classifier effectiveness in isolation. I hope to examine total production cost in a later post.

(There is a third criterion by which we could evaluate selection methods: whether they lead to a biased classifier, one that covers certain sub-topics better than others. Random selection aims to avoid this bias. Once more, I hope to look at this in a later post.)

I'm also going to throw in a fourth document selection pseudo-method:

  1. Balanced Randomly sampling an equal number of responsive and non-responsive training examples at iteration.

Of course, this method cannot be implemented in practice, since we don't know at selection time which documents are responsive and which are non-responsive (if we did, we would not need a predictive coding system!). Introducing the "Balanced" method, however, allows us to examine the effect on classifier effectiveness of the distribution of positive and negative training examples. (We can implement "Balanced" in simulation studies, because all labels are oracularly known to us, even if they are withheld from the classifier for experimental purposes.)

The experimental data I'll use for this study is the RCV1v2 dataset, a standard dataset for text classification work. RCV1v2 consists of slightly over 800,000 Reuters newswire articles, annotated with 103 hierarchical news categories. The data is, therefore, not typical of e-discovery data; but in compensation, it is relatively free of noise, provides many topics to investigate, is fully annotated, and is freely available (so please try replicating my results!).

Experimental details: I use the tokenized version of the dataset, from Online Appendix 12. The classifier is Vowpal Wabbit version 7.4, with the flags --loss_function=logistic --learning_rate=2 --passes=1 --raw_predictions. All methods, including the "balanced" method, start off with an identical random sample of 100 documents. Subsequent iterations are of 100 documents each, up to 1500 documents. The target set from which training examples are drawn is a random sample of 200,000 documents; a disjoint random sample of 20,000 documents is used as the test set. The evaluation metric is Maximum F1; that is, the highest F1 achieved across all cutoffs of the ranking on the test set.

The following figure shows the Max-F1 learning curves (solid lines) for each of our 4 selection methods across 14 representative topics. (I include only topics with richness above 1%, to match the random seed set size of 100.) The figures also show the proportion of responsive documents in the training set (dotted lines):

Learning curves for different example selection methods

Learning curves for different example selection methods

(Click here for a scalable PDF version.)

Two things stand out from these results. First, the Uncertainty method (solid green line) outperforms the Random and Relevance methods (red and black lines) in almost every case. And second, the Balanced pseudo-method (blue line) performs as well as the Uncertainty method for larger training sets, and better for smaller sets on low richness topics. In fact, the richness of training data of the Uncertainty method follows the Balanced method quite closely in most cases; and it is when Uncertainty has fewer positive examples than Balanced that the former method underperforms the latter. Since the Uncertainty method selects training documents that it predicts are 50% likely to be responsive, it is not surprising that it creates a relatively balanced training set; perhaps balance is as important as the selection of uncertain documents per se for Uncertainty's superior performance.

A closer examination of the Random and Relevance methods confirms the importance of training example balance. Random sampling produces approximately the same richness in the training set as in the collection, and does very poorly with low richness topics, though not so badly with higher richness ones. Conversely, the Relevance method does respectably for low richness topics, but poorly for high richness ones. Indeed, for the latter topics, many selection iterations produce almost solely responsive examples. Of course, this is in part what the Relevance method sets out to achieve; but the effect on classifier effectiveness is marked: for several high-richness topics, classifier effectiveness under the Relevance method actually decreases with additional training data.

The previous figure gave results for 14 topics; now let's look across all 51 topics with richness of at least 1%. I'll divide them into topics with richness of 3.4% or lower (26 topics), and those with richness above 3.4% (25 topics). The Max F1 score at each training set size for each selection method, averaged across topics, is as follows:

Mean MaxF1 across topics for different selection types

Mean MaxF1 across topics for different selection types

The trends we observed previously are confirmed: Relevance outperforms Random selection for low richness topics; Random outperforms Relevance for high richness topics; but Uncertainty outperforms both methods in both conditions. Balanced selection quickly rises to its asymptotic effectiveness, suggesting that one of the main challenges early in training is to find sufficient responsive training documents (though examining the results of the Relevance method for individual topics in the first figure alerts us of the danger of getting "stuck" in a sub-topic by trying too directly to focus on responsive training examples). With longer training runs, though, the Uncertainty method achieves a higher asymptotic effectiveness than Balanced selection---an interesting finding, that suggests Uncertainty selection does work to exclude poor training documents.

These results should be compared with those of Dave Lewis and William Gale, "A Sequential Algorithm for Training Text Classifiers" (SIGIR, 1994), and particularly of the corrigendum. Lewis and Gale likewise compare the Uncertainty and Relevance methods, although with a different classifier and a different dataset. In particular, all of the topics considered had very low richness (from 0.07% to 0.49%); Lewis and Gale get around the bootstrapping problem on such low richness topics by starting with 3 randomly selected positive examples. On these low-richness topics, the Relevance method gave similar performance to Uncertainty selection, though with smaller training set sizes, Uncertainty was generally superior. The average performance of the different methods on RCV1v2 topics with richness below 1% (but starting with a random sample of 100 documents) is shown in the following figure:

Mean MaxF1 across topics with richness below 1%

Mean MaxF1 across topics with richness below 1%

These results are similar to Lewis and Gale: on such low richness topics, Uncertainty is only marginally better (on average) than Relevance selection. For such low richness topics, though, it is unlikely that the seed set would be created by the tiresome process of randomly sampling till enough positive examples occur. I hope to return to examine these low-prevalence cases with an alternative seed set selection method (such as querying).

What conclusions can we draw from this study? First, viewed solely from the proportion of relevant documents found (and ignoring questions of possible bias), the Random selection of training documents is a poor choice, unless richness is quite high (say, at least over 5%, preferably over 15%). In terms of achieving classifier performance, the Relevance selection method is also a poor choice, except for very low richness topics (below 1%). For classifier performance itself, the Uncertainty method is the clear winner amongst the methods considered, outperforming even the unimplementable Balanced method for larger training set sizes.

As mentioned above, though, what we generally care about in e-discovery is not classifier quality itself, but the cost to achieve a certain recall (that is, find a certain finite number of relevant documents). In protocols where every produced document must be reviewed, selection methods that pick more responsive documents for training are preferable (other things being equal), because they leave fewer documents for review. I hope to examine the efficiency of the different models under this cost model in a later post; for now, we can note that under such a model, the Random method will look even worse, the Relevance method somewhat better, since they produce low and high (respectively) richness in their training sets.

Finally, note that the Random, Uncertainty, and Relevance methods are implemented here in their simplest and "purest" versions. For the Uncertainty method, for instance, it would be possible (and perhaps preferable) to draw a sample of documents from around the point of maximum predictive uncertainty, rather than exactly the 100 most uncertain documents. Or one could simply combine all three methods, drawing a certain number of training documents by each method, and perhaps adjusting the mix as the training went on. Each method might then supply some of the wants of the other. For now, though, it appears that (at least for classifier quality), Uncertainty selection is the method to beat.

Thanks to Gord Cormack, Maura Grossman, Dave Lewis, Jeremy Pickens, and Herb Roitblat for comments which helped improve this post. The opinions expressed here, of course, are my own, and not theirs.

http://blog.codalism.com/index.php/random-vs-active-selection-of-training-examples-in-e-discovery/feed/ 5