In my previous post, I found that relevance and uncertainty selection needed similar numbers of document relevance assessments to achieve a given level of recall. I summarized this by saying the two methods had similar cost. The number of documents assessed, however, is only a very approximate measure of the cost of a review process, and richer cost models might lead to a different conclusion.
One distinction that is sometimes made is between the cost of training a document, and the cost of reviewing it. It is often assumed that training is performed by a subject-matter expert, whereas review is done by more junior reviewers. The subject-matter expert costs more than the junior reviewers---let's say, five times as much. Therefore, assessing a document for relevance during training will cost more than doing so during review.
With this differentiated cost model, we get the following average assessment costs (see my previous post for the interpretation of this table):
|0.05% -- 0.5%||211.56||50.72||52.14||61.69|
|0.5% -- 2%||23.65||7.34||9.64||11.68|
|2% -- 5%||5.04||3.76||6.55||8.50|
|5% -- 15%||2.22||1.88||4.34||6.54|
As one might expect, relevance selection (which aims to do much or all of the review effort during training) becomes more expensive than uncertainty selection. Moreover, whereas with a uniform cost model relevance selection was almost always best done in continuous mode (that is, all review is done by the trainer), it is frequently better with differentiated costs to stop relevance selection early and leave the tail of the documents for review.
This differentiated cost model assumes that (junior) reviewer decisions are final. In practice, however, that seems unlikely: if we trusted the cheaper junior reviewers so much, why not have them do the training as well? That suggests a third, QC step needs to be applied prior to production, giving us a three-step process of training, review, and production (or T-R-P). A simple form of pre-production QC would be that documents marked relevant by the junior reviewer are then sent for second-pass review by a more senior reviewer---perhaps the SME himself. In that case, we have three costs:
- Training documents: viewed once by SME, 5x cost
- Irrelevant reviewed documents: viewed once by junior reviewer, 1x cost
- Relevant reviewed documents: viewed first by junior reviewer, then by SME, 6x cost
Under this protocol, it is cheaper to handle a relevant document in training than in review (assuming that judgments made during training do not have to be checked during production). With this cost model, the total review costs of the different selection methods look like so:
|0.05% -- 0.5%||216.33||51.97||52.52||61.64|
|0.5% -- 2%||28.54||10.47||10.35||11.63|
|2% -- 5%||9.99||7.95||7.65||8.44|
|5% -- 15%||7.19||6.66||6.16||6.48|
Now, relevance selection is on average slightly cheaper than uncertainty selection, because there is less double-handling of responsive documents.
In the three-step, T-R-P protocol, the goal of the R step (of first-pass review) is to filter out non-responsive documents; that is, to increase precision. For the experiments described above, however, responsive documents make up the bulk of the assessment load. For topics with richness above 0.5%, average precision across the assessment process is around 65%. That seems to be substantially higher than what many vendors report for practical e-discovery cases. Perhaps the RCV1v2 collection is unrepresentatively "easy" as a text classification target. In these conditions, the R step has reduced value, as there are few irrelevant documents to filter out. If precision were lower---that is, if the number of non-responsive documents seen in assessment were higher---then the relative costs would be likely to change.
As noted above, however, the junior reviewers used in the R step will make errors. (The SME will make errors, too, but hopefully at a lower rate, and with more authority.) In particular, they will make errors of two types: false positives (judging irrelevant documents as relevant); and false negatives (judging relevant documents as irrelevant). The P-step approach described above, of only considering judged-relevant documents in second-pass review, will catch the false positives, but will miss the false negatives, lowering true recall. In practice, one would want to perform some form of QC of the judged-irrelevant documents, too; conversely, some proportion of the judged-relevant documents might be produced without QC.
The cost models presented in this post are simplistic, and other protocols are possible. Nevertheless, these results emphasize two points. First, conclusions about total assessment cost depend upon your cost model. But second, your cost model depends upon the details of the protocol you use (and other statistics of the assessment effort). Modeller beware!
Thanks to Rachi Messing, Jeremy Pickens, Gord Cormack, and Maura Grossman for discussions that helped inform this post. Naturally, the opinions expressed here are mine, not theirs.