Comments on: Total review cost of training selection methods William Webber's E-Discovery Consulting Blog Fri, 19 Dec 2014 21:48:55 +0000 hourly 1 By: Joe Howie Fri, 19 Dec 2014 21:48:55 +0000 Here's a high-level look at the process: Documents are clustered based on visual similarity, not on an analysis of associated text. For example, if a Word document is later saved directly to PDF, and a copy is printed, scanned, and saved as TIF, visual classification will place all three copies in the same cluster whether or not the TIF and PDF copies have any text.
Clustering can be scaled with hardware to process enterprise-size collections.

These clusters are reviewed by the review team with the largest clusters being reviewed first. Clusters that are clearly irrelevant can be discarded based on a knowledge of what is in the cluster, and documents in clusters that are clearly relevant can all be deemed relevant. Clusters that may contain some but not all relevant documents can be passed on to the review platform where like documents can be batched to the same reviewers speeding review and helping ensure consistency

The review process can begin as soon as documents have been collected and the clustering has begun. Relevance decisions are persistent and once made are applied to documents that are later added to the cluster.

BeyondRecognition also has "Find" technology that provides search capability with operators based on absolute page coordinates (i.e., within certain range of coordinates on a page) or on relative page coordinates (i.e., one range of coordinates with a specified position relative to another) as a way to further refine decision-making. Visual classification technology also has advantages beyond culling, e.g., redactions can be made based on the page locations of text values, or can be done zonally within clusters to redact even handwriting.

If visual classification has been implemented as a core information governance tool, the organization may also have applied document-type labels to the clusters as a way to assist in applying retention schedules, setting user access permissions, detecting PII, and determining where the content should be stored, and those labels can be used when first gathering or collecting documents. For example, the various clusters that have been labelled as "invoices" will often not be remotely relevant to most litigation and need not be collected.

Tying selection to ongoing information governance minimizes the ongoing e-discovery problem of continually reinventing the wheel. For example, if the files or documents of someone who is routinely provided copies of invoices are included in multiple litigation discovery collections, how many times does the company have to pay to have all those invoices processed and evaluated to determine that they are not relevant?

By: gvc Fri, 12 Dec 2014 20:52:00 +0000 Joe,

You and John have commented in practically every edisco blog that your "visual clustering" tool will effectively find relevant in textual and non-textual documents alike.

It is not clear to me from reading your marketing material what your process is or why it would work. I understand that you identify "glyphs" and do clustering, but that's not enough to code my documents.

In contrast, I believe I understand the role of Catalyst's diversity sampling and Relativity's "stratified" sampling -- both are methods of identifying training documents for a learning algorithm. Perhaps yours is more similar to Relativity's, in which you select members of each cluster for training? Or do you manually review each cluster?

Can you point me to a study that elaborates your process and validates your effectiveness claim?


By: Joe Howie Thu, 11 Dec 2014 17:13:04 +0000 All discussions about text analysis ought to be tempered with the knowledge that in some collections significant percentages of documents do not have any related text or have only poor-quality text, e.g., documents saved as PDF from native applications, or scanned documents – sometimes well above 25-30%. Parties who base initial selections on key word searches will miss collecting those documents. Recall as measured against what was handed off to the text analysis system may vary considerably from what true recall would have been had the non-text or poor text documents been included and accurately evaluated.

Even if no text/poor text documents are ingested into the review process, there’s nothing there for text analysis to analyze. Having some metadata associated with the non-text/poor text documents is far short of having text from the face of the documents.

When visual classification is used during litigation, the review team can examine a few documents per cluster to determine the potential relevance of all the documents in a cluster. This is similar in some ways to the descriptions I’ve read of Catalyst’s cultural diversity algorithm and Relativity’s stratified sampling technique except it is based on visual classifications of all documents not on groupings from the subset of text-bearing documents.

By: william Tue, 09 Dec 2014 23:39:27 +0000 In reply to John Tredennick.


Hi! A great summary of a great process. Of course what I'm describing in these experiments are the purest and simplest implementations of the distinct strategies; a mix of strategies and of human knowledge is likely to get you better results than any single strategy used in isolation. And I entirely agree that diversity is an an important issue, that my experiments do not directly address.


By: John Tredennick Tue, 09 Dec 2014 23:22:33 +0000 This is another fascinating read for those of us interested in this important subject. These studies are important and it is critical that we keep the variables to a minimum in order to draw reasonable conclusions about the data.

In the real world, we use use a mix of techniques to address the issues discussed above. Time will tell who has the right mix but here is how we approach things (in short form without writing another blog post).

1. I won't take a position on whether you cull in advance. What I can say is that when you receive a production, you get what you get. But anything else is for a different posting.

2. We believe that you start by finding as many relevant documents as possible using any means at your disposal--from keyword search to witness interviews. Use all of these as initial training seeds.

3. We then believe you let the review teams loose on the documents. There is no need to have a senior lawyer do the training. It is expensive, delays the process and gains little.

4. We were CAL before CAL was cool (just riffing off an old country song). What I mean is that we feed reviewer judgments back into the system so it can keep learning. Training is review, review is training, as Dr. J would say.

5. However we also worry about what you don't know. So, in every CAL batch we present a numbers of documents selected through an "uncertainty" process. The specific number in our mix is a trade secret.

6. Rather than selecting those uncertainty documents randomly, we wrote a contextual diversity algorithm which analyzes all of the documents, identifies those which we know the least about (as in the reviewers have not yet tagged similar ones) and then choose a representative sample from each group.

7. By presenting the right mix of contextually diverse documents to the reviewers, we mix relevance feedback and uncertainty sampling as a key part of our process.

8. Doing this eliminates or minimizes the "you don't know what you don't know" syndrome. We are actively sampling around uncertainty.

9. Ultimately our goal dovetails with William's starting goal. We are trying to minimize review costs by making the review as efficient as possible. The mix of using judgment seeds rather than random to start, by using a good mix of highly relevant docs for the review (since they need to look at them anyway) and supplementing that mix with uncertainty sampling seemed to us to be the best way to construct our system. The research seems to bear this out.

10. Lastly, we measure results with a systematic sample, which means we sample the ranking from top to bottom. We felt this was at least a bit better than a final random sample. And every little bit helps reduce review costs.

That's our thinking at the least. Thanks again.

By: william Wed, 22 Oct 2014 23:54:04 +0000 In reply to Karl Schieneman.


Hi! I agree that it is desirable that a predictive coding process is able to demonstrate that it has "good coverage of the information space". Think of the information space as a map, over which documents are distributed as points. (This is a reasonable model of the internal representation of many predictive coding systems, though the "map" is in thousands of dimensions.) Then you want to have training examples that provide good coverage of this map, to show that you have considered as much of the information space as possible. Simple random sampling gives you something of this effect, but not as well as you might think: clusters of similar documents will tend to be over-sampled, and even if the documents were evenly distributed through the information space, the random sample would tend to randomly bunch in places. Some forms of active learning are actually fairly good, albeit in an indirect way, at driving example selection away from parts of the space you "know about", and towards part you "don't know about". Now, intuitively, I agree that relevance selection (or continuous active learning) may tend not to give you good coverage of the information space, though as you say there have not been empirical studies that demonstrate this one way or another (at least that I have seen).

Vendors are becoming aware of the importance of diversification of training examples, and started to introduce this feature to their products. Two that spring to mind are Relativity, with their "stratified sampling" technique ( (I think "cluster sampling" would be a better term, though); and Catalyst, with their contextual diversity technology ( Note that such methods can also involved random sampling, just not simple random sampling; that is, we focus sampling effort on what we think will be "high value" parts of the collection, without adopting a fully deterministic selection method. In fact, this is akin to your own suggestions about using more intelligent clustering tools: one might cluster the collection, and then sample the clusters, rather than the raw documents.

What I think would be very regrettable, however, is if practitioners became stuck on the idea of simple random sampling as the only defensible example selection method, and then concluded that collection richness must be kept high enough for random selection to be practicable. That would lead, I fear, to aggressive pre-review culling, performed without adequate statistical or other validation, and leading to a bias of "aiming towards what the lawyers already know about" much more serious than might be incurred by non-random selection methods within a predictive coding process run over an unculled (or at least more lightly culled) collection. (Indeed, I wonder whether the falling richness rates we seem to be encountering is due to an increasing, and welcome, reluctance amongst practitioners to perform this sort of arbitrary culling.)

Finally, I whole-heartedly agree that predictive coding technology and process is still far from mature, and that it would unwise for the community to think otherwise and simply "pick what seems best at the moment". The remedy against this, to my mind, is maintaining objective and rigorous validation practice (most particularly, through a certification sample of production and discard set).

By: Karl Schieneman Wed, 22 Oct 2014 14:21:01 +0000 I am trying hard not to be pulled into this debate and remain neutral on different workflows. I applaud your effort to look at collections of varying richness which mimics the real world. Some users of Predictive Coding encounter collections which are greater in richness than .3% to 1.5% which are the predominant richness levels in the SIGIR paper. But speaking as an attorney, we have an obligation to do "discovery". By aiming which is what seeding does, don't we first lose visibility into the richness of the collection to begin with which may cause a user to lose sight that random could be effective on that collection. But secondly, don't we make it much harder to find anything else if we do limited random training AFTER we have pulled the cream, or most of what we know, out of the collection? The problem as I see it is we are only looking for what we know when we seed but we don't know what we don't know which is a lawyer's job to do investigation or a "reasonable inquiry". Secondly, I know of no studies which measure the differences of types of documents found. Everyone is looking a hit counts to see how well we did in Recall when compared to the baseline in the studies. If we find lots and lots of similar documents, we do well in Recall. But as a lay person, it strikes me as intuitive that finding different types of information is a key part of discovery. That approach might give both sides of the litigation comfort that a "honest" and effective search is being done.

I know this is a difficult problem to solve. Low richness may be endemic in some cases today. But I personally have hope that we will figure that out with different approaches such as rolling productions or Early Case Assessment training runs where richness might be higher, different culling approaches such as visual clustering to increase deduplication and eliminate much higher percentages of non-responsive documents, or have a high richness case like we did in Global Aerospace where random training was highly effective and required only a 5,000 document training set to reveal 130,000 documents out of a collection of 2 million records. That was a highly efficient result and it happened in a real case as opposed to a laboratory. So while not a study, as part of the fabric of common law which is where most of the experimenting happens for lawyers, it shows this approach can be a very effective, very efficient, and transparent solution to searching for responsive documents.

I feel like I am writing a blog post here but after returning from Greenwich England from a trip last month, I am struck by how similar this debate is to the search for Longitude between astronomers and clock makers to steer ships across the oceans of the world. It took 50 years to figure out. With only a few participants such as yourself actively studying and publishing in this space, I am fearful that lawyers will blindly follow the first study they read and stop thinking how it might make more sense to use all the tools at their disposal to experiment and improve their search. Keep up the good work. Personally I am trying to keep an open mind on this topic instead of saying we have the solution right now.

By: Total Review Cost of Training Selection Methods | @ComplexD Sat, 27 Sep 2014 14:26:11 +0000 […] Read the complete article at: Total review cost of training selection methods […]

By: gvc Fri, 26 Sep 2014 23:00:27 +0000 Thanks, William, for confirming the results of our SIGIR paper ( : there is little to recommend random training; and there is little to choose (in terms of hitting a specific target recall, given clairvoyance to pick the best training-set size) between continuous active learning with relevance selection, and simple active learning with uncertainty selection.

We have also confirmed our results on the RCV1-v2 collection, using a single synthetic seed document (derived from the topic description supplied with the collection) and a different learning algorithm (

We prefer to represent the results as gain curves, which you can find at They tell the same story as your curves, but track recall as a function of total effort, instead of cfr80 (80% recall effort) as a function of training-set size. The main curves on our site are chosen (for comparison to our SIGIR paper) to minimize cfr75, but for the purpose of comparison to your results, we also plotted curves chosen to minimize cfr80:

We note that our use of a synthetic seed document appears to overcome the problem you noted with ultra-low prevalence topics, which appears to be an artifact of your use of random seeding, even for the active learning methods. The same problem was overcome in our SIGIR paper by the use of a simplistic "seed query."