Dr. Dave Lewis is visiting us in Melbourne on a short sabbatical, and yesterday he gave an interesting talk at RMIT University on research topics in e-discovery. We also had Dr. Paul Hunter, Principal Research Scientist at FTI Consulting, in the audience, as well as research academics from RMIT and the University of Melbourne, including Professor Mark Sanderson and Professor Tim Baldwin. The discussion amongst attendees was almost as interesting as the talk itself, and a number of suggestions for fruitful research were raised, many with fairly direct relevance to application development. I thought I'd capture some of these topics here.
1. Classification across heterogeneous document types
Most work on text classification to date has been performed upon homogeneous document collections such as RCV1v2---that is, collections in which all of the documents are of much the same type, format, and length (in the case of RCV1v2, all news reports). In e-discovery, however, we're faced with a very heterogeneous set of documents: emails, reports, spreadsheets, presentations, pictures, and more. And each of these types can be divided into subtypes: personal versus business emails; contracts versus business reports; numerical spreadsheets versus tabular arrangement of predominantly textual data; and so forth. There are also differences of format that cross these type boundaries: OCR'ed versus native-format versus plain text, for instance. And along with differences in type come differences in other documents features: 100-page reports mixed in with 100-character emails; Chinese documents alongside English ones. Some coarse-grained separation of types is often performed manually, based frequently upon eliminating document types that text classifiers are believed to perform poorly on (images are generally pulled out for manual review, and spreadsheets may join them). A heterogeneous mix of document types remains to be fed into the text classifier, however. We need more research on the use and effectiveness of classification technology in these environments. Can the documents be classified effectively, and without bias against certain document types, by treating them as if they were an undifferentiated set? Do our existing, rather ad-hoc length normalization techniques work when faced with such enormous differences in document length? Do we need to build separate classifiers for each document type---or, since that seems both wasteful and subject to categorization errors, can a committee of classifiers share some training data while specializing in different types?
2. Automatic detection of document types
To some extent, document types can be detected by file extensions or other signature data. But, as mentioned before, there are still subtypes within each of these file types---and subtypes that cross file type boundaries (contracts, for instance, could be in Word format, or PDF, or OCR'ed from images). There is also a certain degree of commonality between document types across different organizations (most corporate email repositories contain both business and personal emails), though there will also be categories that only occur in certain collections (source code for software firms; CAD drawings for engineering firms), and the degree of granularity will vary between firms and cases ("contracts" may be a sufficient category for a baked goods manufacturer, but different types of contracts should probably be differentiated for a law firm). Identifying these document types can help with early case analysis; with culling and collection decisions; and with classification itself (particularly, of course, if we find we need to specialize our classifiers to different document types).
3. Faceted categorization
Most text categorization (supervised or unsupervised) technology assumes that categories are all of the one type, or at most are hierarchical. But in e-discovery, as we've seen above, at least two different types of categorization are of interest: by content, and by type of document. Conflating these two category types can lead to unhelpful confusion in outputs: document clustering schemes, for instance, in which one cluster represents "on-site inspection reports", and another represents "documents that have been OCR'ed" (with OCR'ed inspection reports possibly fitting into the former, possibly into the latter category). A simple solution to this issue would be to perform two categorizations, one by content, the other by type, each presumably with different feature sets. A fuller solution, however, would be able to factor out differences correlated with one facet when making the categorization in the other (factor out differences caused by OCR'ing, for instance, when trying to identify all the on-site inspection reports).
4. Label propagation across related documents
Text classification and categorization work generally assumes that each document is an entirely distinct entity; that the only relations to be found between documents are those identified by the algorithm itself. In e-discovery, however, there are many different types of connection between documents, including duplicates and near-duplicates, document families, and threads. Different cases will have different formal rules about label propagation (for instance, that if any document in a family is responsive, then the whole family is responsive). These associations also need to be taken into account in designing classification algorithms. When learning from or predicting the label of an email, for instance, how should we account for the other emails in the thread? How should we treat quoted reply text---or the absence of such reply text? Dave Lewis points out that there has been work on collective classification (for instance, using conditional random field models), but this is yet to be applied to e-discovery.
5. Identifying unclassifiable documents
In binary classification systems, there are (by definition) only two categories: for e-discovery, typically "responsive" or "non-responsive". For predictive rankers, documents are assigned degrees of membership to these classes, but the ranking is still along the real number line, for different degrees of responsiveness or non-responsiveness. In e-discovery, though, we really require a third category, of uncategorizable: that is, of documents that the classification system not only can't categorize well now (as detected in active learning), but likely will never be able to categorize adequately. Such documents should then be removed from the classification system and sent for manual review (or perhaps routed to a more specialized classification sub-system). As mentioned before, this distinction is often made in a coarse, manual way at collection time by file type (images don't go to predictive coding, etc.). But many difficult-to-categorize documents still get through, and in any case it is preferable to have automated fail-safes in case manual processes fail.
6. Identifying poor training examples
Some documents make poor training examples. Many systems currently require humans to identify which documents make bad examples, but that is a brittle process, as humans cannot always tell when a document is a bad example or not. Identifying documents that are not helpful in training the classifier is likely related to identifying documents that the classifier is not going to be helpful in predicting responsiveness for, though the relationship may be complex: a labelled training example may seem to provide evidence to classify a set of similar documents, but this evidence may be misleading.
7. Identifying significant fragments in non-significant text
Sometimes, only a section of a document is significant (in particular, responsive), while the rest is unimportant or boilerplate. This distinction is sometimes made at training time, when the trainer is asked to highlight the section of a document that is highly responsive (which, as with identifying poor training examples, is a fraught process); but the distinction is also relevant at prediction time. An important sub-category of this problem is identifying repetitive boilerplate in a set of documents and dealing with it appropriately at classification time. There may be a standard format for contracts, for instance, in which only a small number of conditions are varied; or the organization may have standardized reports, where directions and field names are fixed, and only field values vary. If one version of (for instance) a contract is labelled as non-responsive, then a naive classifier is likely to regard all other versions of the contract as non-responsive, due to the weight of common text; but it may be the differing text that is crucial (perhaps the first contract was with an irrelevant company, but another is with a company related to the dispute). It is not sufficient, however, just to ignore the common text altogether, because then the human trainer is going to end up seeing the contract again and again, with no way of indicating that "no contract of this type is responsive". Near-duplicate systems commonly detect this condition at review time, and the better systems will ask the one reviewer to review all such similar documents at once, with the differing text highlighted. Possibly a similar system needs to be implemented during training.
8. Routing of documents to specialized trainers
It is common to rout documents by type to specific reviewers, as well as to cluster documents prior to review so that the one reviewer is reviewing similar documents at the one time. A similar routing could be performed on training documents, if there were multiple trainers. For instance, if we were able to detect more difficult documents, these could be routed to more senior experts. Such a routing could work particularly well in the continuous learning framework advocated by Gord Cormack and Maura Grossman and implemented by Catalyst, in which there is no separate training step, but all reviewed documents are using for training, and the trained classifier is used to prioritize review. An objection to the continuous learning approach is that junior reviewers make too many mistakes to be reliable trainers; if there were a method for routing difficult documents to more senior reviewers, that objection might be allayed.
9. Total cost of annotation
Existing work on the cost of text classification is limited, and is mostly focused on the tradeoff between number of training annotations and rate of increase in the learning curve. In e-discovery, however, the cost formula is more complex and more interesting. To begin with, we are building a classifier for a fixed, known population, not for general use on an unknown future population (in technical terms, we are building a transductive, not an inductive, classifier). Moreover, training and prediction sets are the same; each document annotated in training is one fewer documents whose responsiveness must be predicted (a condition that Dave Lewis refers to as "finite population annotation"). In addition, in most e-discovery practice, no document is produced without human review, so a document trained is a document that does not have to be reviewed. For these reasons, building the optimal classifier may not be the optimal solution to the e-discovery problem---something that the advocates of continuous learning assert. And to make the equation even more interesting, our annotation budget needs to be split between training and testing: allocating a greater proportion to training will increase the actual effectiveness of our classifier, but decrease our ability to accurately measure that effectiveness; conversely, allocating a greater proportion to testing will give us a more accurate measure of a less accurate classifier.
Please do let me know if you know of any existing predictive coding systems that offer solutions to these problems; or if there are other important research topics in e-discovery that I have missed.
Thanks to Dave Lewis and Paul Hunter for their comments on a draft version of this post.