Comments for Evaluating E-Discovery William Webber's E-Discovery Consulting Blog Wed, 04 Mar 2015 16:21:49 +0000 hourly 1 Comment on Off to FTI: see you on the other side by Ethan A Wed, 04 Mar 2015 16:21:49 +0000 FTI's strong gain, and this blog (and the internet's) loss. Please do consider leaving it up.

Comment on Off to FTI: see you on the other side by Chandler Burgess Mon, 19 Jan 2015 22:33:07 +0000 Congratulations and best of luck. Your blog (and research) has been full of tremendous insight for me, and I hope the eDiscovery community at large.

Comment on Why confidence intervals in e-discovery validation? by Introducing “ei-Recall” – A New Gold Standard for Recall Calculations in Legal Search – Part Two | e-Discovery Team ® Mon, 12 Jan 2015 00:34:08 +0000 […] (free version in arXiv), and his many less formal and easier to understand blogs on the topic: Why confidence intervals in e-discovery validation? (12/9/12); Why training and review (partly) break control sets, (10/20/14);  Why 95% +/- 2% […]

Comment on Confidence intervals on recall and eRecall by gvc Wed, 07 Jan 2015 23:18:11 +0000 Yes, definition 2 is incorrect, but I believe it is the one Roitblat currently uses.

Comment on Confidence intervals on recall and eRecall by william Wed, 07 Jan 2015 01:20:53 +0000 Gord,

Hi! The formula I used was Definition 1. Definition 2 is incorrect, isn't it? (Or at least "elusion" and "prevalence" are being used loosely to mean "discard yield" and "collection yield".)


Comment on Confidence intervals on recall and eRecall by gvc Mon, 05 Jan 2015 21:50:11 +0000 Roitblat gives two inconsistent definitions for eRecall. I'm wondering which you used.

Definition 1. In his earlier work (cited above) he uses prevalence to estimate TP+FN (the total number of relevant documents) and he uses elusion to estimate FN (the total number of missed relevant documents. He then plugs these estimates into a contingency table with N (the total number of documents) and D (the total number of discarded documents). If I am not mistaken, the resulting formula is

eRecall = 1 - (elusion/prevalence * D/N)

Definition 2. In his most recent work (, he defines

eRecall = 1 - elusion/prevalence

Comment on Why training and review (partly) break control sets by Confidence intervals on recall and eRecall « Evaluating E-Discovery Sun, 04 Jan 2015 10:07:59 +0000 […] William Webber's E-Discovery Consulting Blog « Why training and review (partly) break control sets […]

Comment on Total review cost of training selection methods by Joe Howie Fri, 19 Dec 2014 21:48:55 +0000 Here's a high-level look at the process: Documents are clustered based on visual similarity, not on an analysis of associated text. For example, if a Word document is later saved directly to PDF, and a copy is printed, scanned, and saved as TIF, visual classification will place all three copies in the same cluster whether or not the TIF and PDF copies have any text.
Clustering can be scaled with hardware to process enterprise-size collections.

These clusters are reviewed by the review team with the largest clusters being reviewed first. Clusters that are clearly irrelevant can be discarded based on a knowledge of what is in the cluster, and documents in clusters that are clearly relevant can all be deemed relevant. Clusters that may contain some but not all relevant documents can be passed on to the review platform where like documents can be batched to the same reviewers speeding review and helping ensure consistency

The review process can begin as soon as documents have been collected and the clustering has begun. Relevance decisions are persistent and once made are applied to documents that are later added to the cluster.

BeyondRecognition also has "Find" technology that provides search capability with operators based on absolute page coordinates (i.e., within certain range of coordinates on a page) or on relative page coordinates (i.e., one range of coordinates with a specified position relative to another) as a way to further refine decision-making. Visual classification technology also has advantages beyond culling, e.g., redactions can be made based on the page locations of text values, or can be done zonally within clusters to redact even handwriting.

If visual classification has been implemented as a core information governance tool, the organization may also have applied document-type labels to the clusters as a way to assist in applying retention schedules, setting user access permissions, detecting PII, and determining where the content should be stored, and those labels can be used when first gathering or collecting documents. For example, the various clusters that have been labelled as "invoices" will often not be remotely relevant to most litigation and need not be collected.

Tying selection to ongoing information governance minimizes the ongoing e-discovery problem of continually reinventing the wheel. For example, if the files or documents of someone who is routinely provided copies of invoices are included in multiple litigation discovery collections, how many times does the company have to pay to have all those invoices processed and evaluated to determine that they are not relevant?

Comment on Total review cost of training selection methods by gvc Fri, 12 Dec 2014 20:52:00 +0000 Joe,

You and John have commented in practically every edisco blog that your "visual clustering" tool will effectively find relevant in textual and non-textual documents alike.

It is not clear to me from reading your marketing material what your process is or why it would work. I understand that you identify "glyphs" and do clustering, but that's not enough to code my documents.

In contrast, I believe I understand the role of Catalyst's diversity sampling and Relativity's "stratified" sampling -- both are methods of identifying training documents for a learning algorithm. Perhaps yours is more similar to Relativity's, in which you select members of each cluster for training? Or do you manually review each cluster?

Can you point me to a study that elaborates your process and validates your effectiveness claim?


Comment on Computer science is not real science by Mark Thu, 11 Dec 2014 20:45:15 +0000 VERY well written and some valid points, but I respectfully disagree with the conclusion. As I earned my degree in computer science, I was taught to use the scientific method in problem solving. To say a "problem" isn't "natural" and therefore not real science seems a rather arbitrary thing.