Corpus characterization in e-discovery

In e-discovery (document retrieval for civil litigation), one side has the documents, the other side proposes the query. This creates an information asymmetry; the requesting side cannot view the corpus to decide what keywords to use and what queries to propose, and opportunities for query iteration are limited, expensive, and liable to being contested.

What is needed are methods by which the requester can require the responder to characterize the corpus, and guide the requester in constructing queries. These methods need to be reasonable, both in expense and in chance of revealing sensitive but case-irrelevant information, otherwise the judge will not agree to force the responder to comply with them.

A simple but useful method, proposed by Cross and Kerskiek1, is that the responder produce a vocabulary list of their corpus, with term frequency information. Careful analysis of such a frequency vocabulary should uncover unusual terminology, jargon, or abbreviations used in the corpus, as well as more prosaic information, such as variant spellings of keywords.

What other corpus characterizations would be useful in informing the requester in their choice of query keywords? And what tools can IR and NLP researchers give lawyers and e-discovery vendors to help them analyze and understand this information?


1. Cross, David. D and Kerksiek, Sanya Sarich, "Using Electronic Search Tools and Search-Methodology Experts in E-Discovery: A Discussion of Recent Case Law and Other Authorities". In Michael D. Berman, Courtney Ingraffia Barton, and Paul W. Grimm (ed.) Managing E-Discovery and ESI: From Pre-Litigation Through Trial, American Bar Association, 2011.

5 Responses to “Corpus characterization in e-discovery”

  1. [...] IREvalEtAl William Webber’s Research Blog « Corpus characterization in e-discovery [...]

  2. Hi William,

    I'd like to propose a tool that might help with this: Netspeak.

    Netspeak is a search engine that helps you with searching the right words or expressions, say, a word search engine. This is achieved by implementing wildcard search on an index of n-grams that come along with their occurrence frequencies.

    The tool's original use case is to assist writers with searching for commonly used language. For instance, when writing in a foreign language, writers often lack the vocabulary and usage skills of a native speaker and hence make errors. With Netspeak, writers can avoid this to some extent since they can check whether something they intend to write is actually commonly written by native speakers. Moreover, wildcards can be inserted at positions of doubt in a query. This way, Netspeak's users can search for words and expressions others have written, assess how common something is written, and hence improve their writing.

    In e-discovery, a tool like Netspeak might help the requester as follows: he might ask the responder to produce a list of n-grams up to a certain length n (e.g., n<= 5) along with their occurrence frequencies and have them indexed by Netspeak. Next, the requesters might search the n-grams using wildcard queries to get an idea for the commonly used vocabulary of the responder given certain contexts, which may then be used to come up with meaningful queries.

    What do you think?

    Best,
    Martin

  3. william says:

    First of all, that's a really cool tool! What's the backing corpus for the version on the web? To see what analogies were most common, I searched for "? is like a ?", and the top hit by a long way is "which is like a keyword"; I more prosaic result than I'd been expecting! (My favourite was "life is like a river".)

    N-grams and wildcard patterns based on them are a promising form of corpus characterization. In an e-discovery or similar adversarial situation, you wouldn't be able to back them with surrounding text from the original corpus. The data going "over the wall" from responder to requester would just be the n-grams and their frequency counts. I wonder, how high can the n in n-gram be, and how low can the minimum frequency go, before you start giving away sensitive information?

    Thanks for the suggestion; it raises a lot of interesting ideas.

  4. Thanks, William! The corpus underlying Netspeak is the "Web 1T 5-gram Version 1" or simply the Google n-grams. We're also working on integrating the Google books n-grams.

    If frequencies of certain n-grams are very low and which is particularly likely at a high n, it might be too easy to reconstruct documents. This is certainly worth studying. Another option might be that the requester does not get direct access to the data, but only to the search interface, which might alleviate the problem a little bit.

    What kinds of evaluation corpora do you use? Maybe we can set up a version of Netspeak for study...

  5. william says:

    The most widely-used corpus for public e-discovery research is the Enron corpus. There are many versions of the Enron corpus; the easiest to work with is the TREC de-duplicated text-only version, stored and described in the TREC legal track data area. I guess the test would be something like "can searchers come up with better Boolean queries with access to FOO than without", where FOO would be a vocabulary list, a list of ngrams, Netspeak as in interface to a list of ngrams, etc..

Leave a Reply