A technology-assisted review (TAR) process frequently begins with the creation of a control set---a set of documents randomly sampled from the collection, and coded by a human expert for relevance. The control set can then be used to estimate the richness (proportion relevant) of the collection, and also to gauge the effectiveness of a predictive coding (PC) system as training is undertaken. We might also want to use the control set to estimate the completeness of the TAR process as a whole. However, we may run into problems if we attempt to do so.
The reason the control set can be used to estimate the effectiveness of the PC system on the collection is that it is a random sample of that collection. As training proceeds, however, the relevance of some of the documents in the collection will become known through human assessment---even more so if review begins before training is complete (as is often the case). Direct measures of process effectiveness on the control set will fail to take account of the relevant and irrelevant documents already found through human assessment.
A naïve solution to this problem to exclude the already-reviewed documents from the collection; to use the control set to estimate effectiveness only on the remaining documents (the remnant); and then to combine estimated remnant effectiveness with what has been found by manual means. This approach, however, is incorrect: as documents are non-randomly removed from the collection, the control set ceases to be randomly representative of the remnant. In particular, if training (through active learning) or review is prioritized towards easily-found relevant documents, then easily-found relevant documents will become rare in the remnant; the control set will overstate effectiveness on the remnant, and hence will overstate the recall of the TAR process overall. (If training has been performed purely by random sampling, though, and no other review has been undertaken, then this approach is roughly accurate.)
To restore the representativeness of the control set, we need to remove from it the equivalent documents to those that have been removed from the collection by review. There is, however, no general way that of doing this: we can’t in general say which of the control set documents would have been reviewed, had they been part of the collection. There are, it is true, particular cases in which this removal rule can be determined. For instance, if a set of documents were selected for training or review by a keyword query, then the same documents should be excluded from the control set. Even if all documents selections were by similar rules, however, keeping track of all these rules over the course of a TAR process quickly becomes infeasible.
One case in which control set exclusion can be performed is prospectively, based upon a review cutoff decision. If we undertake to review all documents with a certain relevance score or higher (and if all the documents already reviewed are above that relevance score), then the part of the control set with relevance scores below this threshold could be used to estimate the number of relevant documents that would be excluded from the review. We cannot, however, form a statistically valid estimate of the number of relevant documents above this threshold, unless we ignore the evidence of the above-cutoff documents that have already been reviewed. (To see this, consider that the estimated number of relevant documents above the threshold might be less than the number we had already seen in review.) We also have no principled way of handling any reviewed documents that happen to fall below the threshold, except to make worst-case assumptions (in particular, to exclude such documents in estimating recall).
The above points do not mean that control sets are not useful in guiding decisions about a TAR process, including the important decision of when to stop training and how much further review might be required. However, care must be taken in deriving guidance from control sets in the presence of training and review, and even then the resulting estimates should be recognized as not being statistically valid ones (that is, not strictly justified by sampling theory). In particular, practitioners should be wary about the use of control sets to certify the completeness of a production---besides the sequential testing bias inherent in repeated testing against the one control set, and the fact that control set relevance judgments are made in the relative ignorance of the beginning of the TAR process. A separate certification sample should be preferred for making final assessments of production completeness.