Comments on: Why training and review (partly) break control sets http://blog.codalism.com/index.php/why-training-and-review-partly-break-control-sets/ William Webber's E-Discovery Consulting Blog Sun, 04 Jan 2015 10:07:59 +0000 hourly 1 https://wordpress.org/?v=6.1.1 By: Confidence intervals on recall and eRecall « Evaluating E-Discovery http://blog.codalism.com/index.php/why-training-and-review-partly-break-control-sets/comment-page-1/#comment-3571062 Sun, 04 Jan 2015 10:07:59 +0000 http://blog.codalism.com/?p=2344#comment-3571062 […] William Webber's E-Discovery Consulting Blog « Why training and review (partly) break control sets […]

]]>
By: gvc http://blog.codalism.com/index.php/why-training-and-review-partly-break-control-sets/comment-page-1/#comment-3063285 Sun, 02 Nov 2014 23:20:53 +0000 http://blog.codalism.com/?p=2344#comment-3063285 I've always been a bit confused by what particular vendors mean by "control set." My naive impression would be that a control set would be an independent sample of the collection, as opposed to a holdout set (which is necessarily not independent in the statistical sense). From the various descriptions of control sets it appears that some double as training sets (in which case, I presume, cross-validation is used), while others are used as straight evaluation sets. In the latter role, I would think an independent sample would avoid the issues you describe in this blog. But I cannot tell from the vendors' literature which they use.

There would still be the issue of interaction between the control set and any subsequent judgmental selection of training examples. To maintain independence, you'd have to isolate the control-set assessors from those selecting the training examples, or select the training examples before reviewing the control set.

]]>