In e-discovery (document retrieval for civil litigation), one side has the documents, the other side proposes the query. This creates an information asymmetry; the requesting side cannot view the corpus to decide what keywords to use and what queries to propose, and opportunities for query iteration are limited, expensive, and liable to being contested.
What is needed are methods by which the requester can require the responder to characterize the corpus, and guide the requester in constructing queries. These methods need to be reasonable, both in expense and in chance of revealing sensitive but case-irrelevant information, otherwise the judge will not agree to force the responder to comply with them.
A simple but useful method, proposed by Cross and Kerskiek1, is that the responder produce a vocabulary list of their corpus, with term frequency information. Careful analysis of such a frequency vocabulary should uncover unusual terminology, jargon, or abbreviations used in the corpus, as well as more prosaic information, such as variant spellings of keywords.
What other corpus characterizations would be useful in informing the requester in their choice of query keywords? And what tools can IR and NLP researchers give lawyers and e-discovery vendors to help them analyze and understand this information?
1. Cross, David. D and Kerksiek, Sanya Sarich, "Using Electronic Search Tools and Search-Methodology Experts in E-Discovery: A Discussion of Recent Case Law and Other Authorities". In Michael D. Berman, Courtney Ingraffia Barton, and Paul W. Grimm (ed.) Managing E-Discovery and ESI: From Pre-Litigation Through Trial, American Bar Association, 2011.