<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>IREvalEtAl</title>
	<atom:link href="http://blog.codalism.com/?feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://blog.codalism.com</link>
	<description>William Webber's Research Blog</description>
	<lastBuildDate>Fri, 10 May 2013 21:43:22 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>What is the maximum recall in re Biomet?</title>
		<link>http://blog.codalism.com/?p=1808</link>
		<comments>http://blog.codalism.com/?p=1808#comments</comments>
		<pubDate>Wed, 24 Apr 2013 00:00:39 +0000</pubDate>
		<dc:creator>william</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.codalism.com/?p=1808</guid>
		<description><![CDATA[There has been a flurry of interest the past couple of days over Judge Miller's order in re Biomet. In their e-discovery process, the defendants employed a keyword filter to reduce the size of the collection, and input only the post-filtering documents to their vendor's predictive coding system, which seems to be a frequent practice [...]]]></description>
				<content:encoded><![CDATA[<p>There has been a flurry of interest the past couple of days over <a href="http://it-lex.org/wp-content/uploads/2013/04/Multimodal-eDiscovery.pdf">Judge Miller's order</a> in re Biomet. In their e-discovery process, the defendants employed a keyword filter to reduce the size of the collection, and input only the post-filtering documents to their vendor's predictive coding system, which seems to be a frequent practice at the current stage of adoption of predictive coding technology.  The plaintiffs demanded that instead the defendants apply predictive coding to the full collection.  Judge Miller <a href="http://it-lex.org/indiana-district-court-approves-multimodal-computer-assisted-review/">found in favour of the defendants</a>.<br />
<span id="more-1808"></span><br />
In support of their position, the defendants prepared a detailed affidavit <a href="http://codalism.com/~wew/docserver/biomet-ESI-defendant-affidavit.pdf">setting out their sample-based evaluation</a> (as well as providing costings and other interesting details concerning their computer-assisted review).  This affidavit is somewhat confusing in that it expresses the inclusiveness of the keyword filter in terms of prevalence rates in different parts of the collection (2% in the collection as a whole; 1% in the null set; 16% in the filtered-in set).  What we really care about is recall; that is, the proportion of responsive documents that are included in the production---or rather, here, the upper bound on recall, since the filtered-in set is the input to, not the output from, review.</p>
<p>The sampling procedure used to estimate completeness of the filtering step is described in some detail in Paragraphs 11 through 13, Exhibit C, of the <a href="http://codalism.com/~wew/docserver/biomet-ESI-defendant-affidavit.pdf">defendant's affidavit</a> (physical page 86 of the PDF).  The full collection is made up of 19.5 million documents (I'll round figures for simplicity).  The keyword filter extracted 3.9 million of them into the document review system; de-duplication reduced this further to 2.5 million unique documents.  The question then is, what proportion of the responsive documents made it into the filtered set; or, conversely, what proportion were left in the 16.6 million excluded documents?</p>
<p>The defendants performed three separate and independent (and partially redundant) samples to answer this question: one of the full collection, one of the filtered-in set, and one of the filtered-out set.  The defendant's affidavit provides confidence intervals, but to begin with, let's just work with point estimates of the quantities of interest.  The sample results were as follows:</p>
<table cellspacing="5">
<tr>
<th>Segment</th>
<th>Population</th>
<th>Sample</th>
<th>#&nbsp;Responsive</th>
<th>%&nbsp;Responsive</th>
<th>Yield estimate</th>
</tr>
<tr>
<td>Collection</td>
<td>19.5&nbsp;million</td>
<td>4146</td>
<td>80</td>
<td>1.9%</td>
<td>370,000</td>
</tr>
<tr>
<td>Filtered-out</td>
<td><strike>16.6</strike>&nbsp;15.6&nbsp;million</td>
<td>4146</td>
<td>39</td>
<td>0.95%</td>
<td><strike>158,000</strike>&nbsp;148,000</td>
</tr>
<tr>
<td>Filtered-in (dedup'ed)</td>
<td>2.5 million</td>
<td>1689</td>
<td>273</td>
<td>16.2%</td>
<td>405,000</td>
</tr>
</table>
<p>Immediately, you'll see we have a problem.  The estimate of the filtered-in set exceeds that of the entire collection, let alone of the collection minus the filtered-out set.  In fact, the latter (<strike>212,000</strike> 222,000) is well outside the 99% confidence interval on the filtered-in set yield of [348,126; 464,887].  And that's understating matters: the filtered-out and collection estimates are undeduplicated, whereas the filtered-in estimate is deduplicated (with a reduction to around 65% of the undeduplicated size).</p>
<p>And indeed the <a href="http://codalism.com/~wew/docserver/biomet-ESI-defendant-affidavit.pdf">working of the affidavit</a> here is incorrect.  I cite verbatim from Paragraph 11 of Exhibit C:</p>
<blockquote><p>
A random sample with a confidence level of 95% and estimation interval of 2.377% consisting of 1,689 documents was drawn from the 2.5+ million documents published to Axcelerate.  This sample was reviewed to obtain a baseline relevance rate for the document population created by keyword culling.  273 documents were identified as relevant in the sample, indicating with 95% confidence that the percentage of relevant documents in the population is between 184,268 and 229,162 or stated in percentages, between 14.41% and 17.91%.
</p></blockquote>
<p>The interval of "between 184,268 and 229,162" is coherent with the other estimates in the above table, but it is _not_ "between 14.41% and 17.91%" of "2.5+ million documents", as a few seconds with a calculator will confirm (rather, it is between 7.37% and 9.17%).  The 14.41% to 17.91% interval is correct given the stated sample output, but doesn't work with the other estimates.</p>
<p>I'm not sure what the mistake is that has been made here (perhaps confusing figures from pre-deduplication and post-deduplication?), but let's ignore the sample from the filtered-in set, and look just at the other two samples in the above table.  The point estimates are round 370,000 responsive documents in the full collection, and <strike>158,000</strike> 148,000 responsive documents in the filtered-out set.  This means that roughly <strike>42%</strike> 40% of the responsive documents are excluded from the set sent to document review, or in other words that even a flawless review process can achieve a maximum recall of <strike>58%</strike> 60%.  (And that's further assuming that the collection itself has not excluded responsive material.)</p>
<p>It's not my intention to comment on whether the defendant's use of keyword pre-filtering is appropriate and proportionate for the particular circumstances of this particular case. But these figures do illustrate the likelihood that keyword pre-filtering will exclude a large volume of responsive data, and often (where defendants are not so thorough as the current ones have been in their sampling and validation protocol) without anyone being aware of it.  The problems artificially imposed by vendors charging by volume aside, I venture to suggest that keyword pre-filtering does not represent best production practice.</p>
<p>And finally, to return to any earlier point, it's really time that practitioners and counsel stop producing the plethora of different estimates that we find in this case, and started sampling for, estimating, and reporting the prime quantity of interest: namely recall.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.codalism.com/?feed=rss2&#038;p=1808</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Stratified sampling in e-discovery evaluation</title>
		<link>http://blog.codalism.com/?p=1767</link>
		<comments>http://blog.codalism.com/?p=1767#comments</comments>
		<pubDate>Wed, 17 Apr 2013 18:46:12 +0000</pubDate>
		<dc:creator>william</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.codalism.com/?p=1767</guid>
		<description><![CDATA[Point- and lower-bound confidence estimates on the completeness (or recall) of an e-discovery production are calculated by sampling documents, from both the production and the remainder of the collection (the null set). The most straightforward way to draw this sample is as a simple random sample (SRS) across the whole collection, produced and unproduced. However, [...]]]></description>
				<content:encoded><![CDATA[<p>Point- and lower-bound confidence estimates on the completeness (or <i>recall</i>) of an e-discovery production are calculated by sampling documents, from both the production and the remainder of the collection (the <i>null set</i>).  The most straightforward way to draw this sample is as a simple random sample (SRS) across the whole collection, produced and unproduced.  However, the same level of accuracy can be achieved for a fraction of the review cost by using stratified sampling instead.  In this post, I introduce the use of stratified sampling in the evaluation of e-discovery productions.  In a later post, I will provide worked examples, illustrating the saving in review cost that can be achieved.<br />
<span id="more-1767"></span><br />
Sampling aims to estimate characteristics of a population by looking at a sample of that population.  In simple random sampling, we sample from the population indifferently; each element of the population (and each combination of elements) has the same probability of being included.  By contrast, in stratified sampling, we divide the population up into disjoint sets or strata; sample from each stratum separately; and combine the evidence from each stratum into an overall estimate.  </p>
<p>Stratified sampling improves estimate accuracy in two ways.  First, if the characteristic of interest (in e-discovery, the proportion of responsive documents) differs systematically between sub-populations, then accuracy is gained by estimating the characteristic for each sub-population separately, and then aggregating it for the population as a whole.  The increase in accuracy comes because we substitute lower within-sub-population for higher between-sub-population sampling error.</p>
<p><img src="http://farm3.staticflickr.com/2781/4395220778_fb886570a2.jpg"></p>
<p>Say, for example, we're estimating the average weight of a mixed herd of great danes and chihuahuas.  In stratified sampling, we sample from the great danes and from the chihuahuas separately, and only have to worry about whether the sampled great danes are representative of all great danes, and the sampled chihuahuas of all chihuahuas.  In simple random sampling, we sample dogs indifferently, and then have to worry about whether the sampled dogs are representative of dogs in general.  Happening unawares to sample more great danes than chihuahuas will hurt our estimate a lot; happening by chance to sample more heavy chihuahuas than light chihuahuas will hurt our estimate only a little.</p>
<p>Stratified sampling offers an additional advantage if sub-populations have differing homogeneities.  The more heterogeneous a sub-population, the less accurate a sample-based estimate.  We can achieve greater accuracy for the same sample budget by  sampling the more heterogeneous sub-population at a higher rate than the more homogeneous sub-population.  If our chihuahuas all have similar weights, while the great danes are more variable, then we're better off devoting more samples to the great danes, and fewer to the chihuahuas.</p>
<p><img src="http://farm1.staticflickr.com/140/389891490_5aebe298d0.jpg"></p>
<p>How does all of this apply to production evaluation in e-discovery?  Here, we have two clearly defined sub-populations: the (candidate) production and the null set.  These sub-populations differ significantly in the characteristic of interest, namely the proportion of responsive documents: the production is likely to be dense with them, the null set vary sparse.  The null set is also far more homogeneous, being overwhelmingly non-responsive, and so we can sample it more lightly and assign the saved annotations to the more heterogeneous production. <b>(*)</b></p>
<p>We can go further and divide the collection into more strata than just the produced and null sets.  Most contemporary predictive coding systems are able to rank documents by probability of responsiveness, with the production being determined by truncating the ranking at some cutoff.  The documents in the null set just below this cutoff are more likely to be responsive than the documents further down.  Therefore, it makes sense to divide the null set into two (or more) strata based on this ranking, and sample the upper null set more densely than the lower null set---to look, in essence, for responsive documents missed by the production where we are more likely to find them.</p>
<p>Finally, we require a method of estimation, not just of point estimates, but of confidence intervals.  For simple random sampling, the simple binomial confidence interval is used, on the proportion of sampled responsive documents that happen to fall in the production.  Interval estimation for stratified sampling is more complex, which has perhaps deterred practitioners to date.  However, a method of estimating confidence intervals on recall from stratified sampling is described and validated in my recently published ACMTOIS article, <a href="http://dx.doi.org/10.1145/2414782.2414784">Approximate Recall Confidence Intervals</a> (January 2013, Volume 31, Issue 1, pages 2:1--33) (<a href="http://arxiv.org/abs/1202.2880">free version in arXiv</a>).  In a later posting, I will provided worked examples with the two sampling and interval methods, and illustrate the significant cost savings that stratified sampling offers.</p>
<hr />
<p><b>(*)</b> In practice, predicting the optimal sample allocation between null set and production set is more complicated than just observing which has the higher prevalence, since the production set yield occurs twice in the recall formula (one as numerator, one as denominator), and the null set yield is a divisor, not an added term, in the formula.  We will examine this issue in more detail in a later post.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.codalism.com/?feed=rss2&#038;p=1767</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Does automatic text classification work for low-prevalence topics?</title>
		<link>http://blog.codalism.com/?p=1758</link>
		<comments>http://blog.codalism.com/?p=1758#comments</comments>
		<pubDate>Fri, 25 Jan 2013 17:06:07 +0000</pubDate>
		<dc:creator>william</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.codalism.com/?p=1758</guid>
		<description><![CDATA[Readers of Ralph Losey's blog will know that he is an advocate of what he calls "multimodal" search in e-discovery; that is, a search methodology that involves a mix of human-directed keyword search, human-machine blended concept search, and machine-directed text classification (or predictive coding, in the e-discovery jargon). Meanwhile, he deprecates the alternative model of [...]]]></description>
				<content:encoded><![CDATA[<p>Readers of <a href="http://e-discoveryteam.com/">Ralph Losey's blog</a> will know that he is an advocate of what he calls <a href="http://e-discoveryteam.com/2012/07/08/day-two-of-a-predictive-coding-narrative-more-than-a-random-stroll-down-memory-lane/">"multimodal"</a> search in e-discovery; that is, a search methodology that involves a mix of human-directed keyword search, human-machine blended concept search, and machine-directed text classification (or predictive coding, in the e-discovery jargon).  Meanwhile, he deprecates the alternative model of machine-driven review, in which only text classification technology is employed, and the human's sole function is to code machine-selected documents as responsive or non-responsive.  The difference between the two modes can be seen most clearly in the creation of the seed set (that is, the initial training set created to bootstrap the text classification process).  In multimodal review, the seed set might be taken from (a sample of) the results of active search by the human reviewer; in machine-driven review, the seed set is formed by a random sample from the collection.<br />
<span id="more-1758"></span><br />
Late last year, Ralph journalized <a href="http://e-discoveryteam.com/2012/07/01/day-one-of-a-predictive-coding-narrative-searching-for-relevance-in-the-ashes-of-enron/">an experiment in multimodal review</a> on the Enron collection.  Currently, he is journalizing a more fictionalized (but still empirically-based) account of machine-driven review (colourfully titled <a href="http://e-discoveryteam.com/2013/01/13/journey-into-the-borg-hive-part-one-of-a-new-e-discovery-sci-fi-saga/">"Journey into the Borg Hive"</a>).  In a <a href="http://e-discoveryteam.com/2013/01/19/journey-into-the-borg-hive-part-two/">recent post</a> in the latter series, he raises what is perhaps the most important point in the dispute between the multimodal and machine-driven approaches: what happens when your collection has a low prevalence of responsive documents---when only half a percent of your documents are responsive, rather than ten percent?  Prevalence is partly determined by the topic, but (as Ralph notes) is even more strongly driven by collection and culling decisions.  The lighter the culling of the collection (by keyword filters, for instance), the lower the prevalence.   But crude culling (and keyword culling counts as crude) risks excluding relevant material, and so should be avoided, especially now that automated text analysis technologies allow (in principle) arbitrarily large collections to be efficiently searched.  Therefore, low prevalence collections will become increasingly frequent in practice.</p>
<p>What difference does collection prevalence make to the choice between multimodal and computer-driven review?   Well, text classification relies on having both positive and negative examples.  The less balanced the example set is, the more examples you require to achieve a given level of effectiveness in the classifier.  Indeed, a rough rule of thumb is that cost of training scales more with the number of examples in total, at least where examples are selected by pure random sampling.  </p>
<p>The need for positive examples creates a boot-strapping problem for a machine-driven classification approach when collection prevalence is low.  If responsive documents are rare, then you'll need a large initial random sample (and therefore extensive review time) to locate them.  It would likely be much faster to find an adequate number of relevant documents by a human-directed (keyword or concept) search.  Moreover, learning is likely to be slower the fewer relevant documents there are in the initial sample set, even with active learning; again, it might be more efficient to have the human actively help locate relevant documents in the early stages of the search, and delay moving into computer-driven mode until the review process is well under-way.</p>
<p>How serious is the effect of positive-example seed-set deprivation on the efficiency of text classification?  The impact on seed-set creation is easy to calculate; it's a simple question of sampling.  Let's say that a minimum of ten responsive documents are required to get the text classifier going; if one in a thousand documents in the collection is responsive, then on average a sample of ten-thousand documents is required to make an adequate seed set (more if you want a reasonable degree of confidence that at least ten positive examples will be found).   With a low enough prevalence, a randomly-sampled seed set may simply be a non-starter.</p>
<p>Say, though, we have a more intermediate case, where prevalence, while low, is still sufficient to give a handful of responsive documents in a moderately-sized sample; Ralph's own example produces 10 relevant documents in a sample of 2,401, for a sample prevalence of just under half a percent.  The effect of such a low-prevalence seed set on speed of learning is complex, particularly when active learning is being used, as is the case with most classification systems these days.   (In active learning, the computer selects for coding at each training iteration those documents that it is least certain about. This is generally much more efficient than selecting documents at random, but still requires an initial seed set to get started.)  Active learning should in principle compensate for low prevalence, by exploring ambiguous regions of the document space.  How strong this effect will be in practice is an empirical question.</p>
<p>What, then, is the argument in favour of solely computer-driven review, even in the face of low-prevalence tasks (apart from the fact that it is much simpler to build a user interface for batch annotation than for active search)?  The main argument is the assertion that relying on human judgment to select seed documents will bias the results towards the human's initial conception of relevance; that responsive documents similar to those found in the initial keyword search are more likely to appear in the final production than responsive documents different from the seed set.  Certainly, human choice in creating the seed set will have <i>some</i> impact on final classifier output.  On the other hand, in the subsequent training iterations, active learning will select documents that are <i>unlike</i> the initial seed set, and so over time should drive the classifier away from dependence on the human's initial judgment.  Again, the strength of this effect is an empirical question.</p>
<p>So, we have two empirical questions about the comparative impact of random and human-created seed sets for active learning on low-prevalence topics.  Is active learning on a seed set with only a handful of responsive documents able to close the effectiveness gap with a higher-prevalence seed set populated by human search, and if so, how quickly?  And how strongly does a human-selected seed set bias final classification results towards the sub-topic represented in that seed set?</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.codalism.com/?feed=rss2&#038;p=1758</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Why confidence intervals in e-discovery validation?</title>
		<link>http://blog.codalism.com/?p=1754</link>
		<comments>http://blog.codalism.com/?p=1754#comments</comments>
		<pubDate>Sat, 08 Dec 2012 19:59:49 +0000</pubDate>
		<dc:creator>william</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.codalism.com/?p=1754</guid>
		<description><![CDATA[A question that often comes up when discussing e-discovery validation protocols is, why should they be based on confidence intervals, rather than point estimates? That is, why do we say, for instance, "the production will be accepted if we have 95% confidence that its recall is greater than 60%"? Why not just say "the production [...]]]></description>
				<content:encoded><![CDATA[<p>A question that often comes up when discussing e-discovery validation protocols is, why should they be based on confidence intervals, rather than point estimates?  That is, why do we say, for instance, "the production will be accepted if we have 95% confidence that its recall is greater than 60%"?  Why not just say "the production will be accepted if its estimated recall is 75%"?  Indeed, there have been some recent protocols that take the latter, point-estimate approach.  (The ESI protocol of <a href="http://www.ediscoverylaw.com/ MemoSupportPredictiveCoding.pdf">Global Aerospace, Inc. v. Landow Aviation</a>, for instance, states simply that 75% recall shall be the "acceptable recall criterion", without specifying anything about confidence levels.) Answering this question requires some reflection on what we are trying to achieve with a validation protocol in the first place.<br />
<span id="more-1754"></span><br />
There are two ways that a validation protocol can fail: it can pass a bad production (a false positive error); and it can reject a good one (a false negative error).  We want the probability of each of these to be as low as possible, of course, but there is an unavoidable tradeoff. The stricter we make the threshold, the less likely we are to accept a bad production, but the more likely we are to reject a good one.  Conversely, looser thresholds will pass good productions more reliably, but pass bad ones more readily, too.  The only way to decrease the likelihood of both failure modes is, not to adjust the threshold, but to increase the sample size.  And so ultimately what we want to know is, what sample size is required to get both validation failure probabilities down to acceptable levels?</p>
<p>The confidence interval approach directly answers this question of failure probability.  When we say that "recall is greater than 60% with 95% confidence", we are expressing the probability that a false positive will occur, if we accept the production.  Here, the 60% recall is the lower limit on an acceptable quality of production.  The unstated third term in this equation is the sample size; but it is the sample size that allows us to make these statements about confidence.</p>
<p>Conversely, if the production fails -- if the lower bound is less that our threshold (here, 60%) -- we can look at the upper bound on recall, and check that it is below the threshold of a system that should reliably pass.  If, for instance, we had a confidence interval on recall of [55%, 95%], then we have a serious problem: we can't pass the production (lower-bound recall is only 55%), but we can't be confident that the production is not actually good (upper-bound recall is 95%).</p>
<p>We'd prefer not to realize that we've failed a good system after validation is complete, of course.  Therefore, we plan the sample size of our validation protocol in advance, to reduce the chance of failing a good production to an acceptable level.   (And note that the lower bound of a "good" production must be higher than the upper bound of a "bad" production; otherwise, the probability of rejecting a good system is always one minus the probability of accepting a bad one.)  Planning the protocol to pass good productions requires similar reasoning about sample sizes and failure probabilities to that of calculating the confidence interval after the production is sampled.</p>
<p>The problem with using a point estimate alone is that it gives us none of this information about confidence and failure probabilities.  It fails to do so because it does not consider the size of the sample we have used.  A point estimate of 75% recall means quite a different thing on a sample of 100 documents than it does on a sample of 10,000.</p>
<p>One motive for the preference for point intervals is the perception that confidence intervals make the protocol unnecessarily more difficult to satisfy.  This objection is poorly founded in theory, in that one can generally state a confidence interval requirement that is as loose or strict as a point estimate one. The objection perhaps has some psychological basis, as it may seem (to the statistically naive) that a 95% confidence threshold of 60% is looser than a point estimate threshold of 75% recall.  But that psychological judgment is achieved only by ignoring what the different bounds are telling us.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.codalism.com/?feed=rss2&#038;p=1754</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The environmental consequences of SIGIR</title>
		<link>http://blog.codalism.com/?p=1745</link>
		<comments>http://blog.codalism.com/?p=1745#comments</comments>
		<pubDate>Mon, 03 Dec 2012 15:59:32 +0000</pubDate>
		<dc:creator>william</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.codalism.com/?p=1745</guid>
		<description><![CDATA[As it is becoming apparent that, without drastic immediate action, we are going to significantly overshoot greenhouse gas emission targets and warm the planet by an environmentally disastrous 4 to 5 degrees centigrade by the end of the century, I thought I should fulfil my long-standing promise to myself and calculate the carbon emissions generated [...]]]></description>
				<content:encoded><![CDATA[<p>As it is becoming apparent that, without drastic immediate action, <a href="http://www.theage.com.au/environment/climate-change/its-the-end-of-the-world-as-we-know-it-20121202-2ap4l.html">we are going to significantly overshoot greenhouse gas emission targets and warm the planet by an environmentally disastrous 4 to 5 degrees centigrade by the end of the century</a>, I thought I should fulfil my long-standing promise to myself and calculate the carbon emissions generated by the annual SIGIR conference.  I'm only going to consider here the emissions caused by air travel; but this is likely to be the overwhelming majority of total emissions.<br />
<span id="more-1745"></span><br />
SIGIR attendance varies from year to year: <a href="http://www.sigir.org/reports/sigir-notes-2010.pdf">610 for 2009</a>, <a href="http://www.sigir.org/reports/sigir-notes-2011.pdf">553 for 2010, 800 for 2011</a>, and <a href="http://www.sigir.org/sigir2012/">500-odd</a> for 2012.  Let's take 600 as a rough estimate of average attendance.  Attendance was 50% in-continent in 2011, 40% in-continent in 2010, and 60% in-continent in 2009 (for, respectively, Asia, Europe, and North America).  Let's say 50% in-continent and 50% out-of-continent.</p>
<p>Let's take Chicago--Los Angeles non-stop round-trip as a representative in-continent flight, and Frankfurt--Beijing non-stop round-trip as a representative out-of-continent flight. According to <a href="http://www2.icao.int/en/carbonoffset/Pages/default.aspx">the emissions calculator provided by the International Civil Aviation Organization</a>, the former trip generates around 480kg of CO2, the latter around 1,200kg.  (We're being conservative here and assuming everyone flies economy, and everyone can make a non-stop trip.  You can add <a href="http://www.terrapass.com/carbon-footprint-calculator-2/#air">roughly 200 kilos of CO2 per stop</a>).  Therefore we have (480 * 300 + 1200 * 300) = 504 metric tons of CO2 gas.</p>
<p>So how much is 504 metric tons of CO2?  The US EPA provides <a href="www.epa.gov/cleanenergy/energy-resources/calculator.html">a handy calculator of greenhouse gas equivalences</a>; according to this, 504 metric tons of CO2 is equivalent to the annual emissions of around 100 US passenger vehicles; the energy use of 44 US households; 5 acres of deforestation; and so forth.</p>
<p>What could be done to cut these emissions?  The simplest answer would be -- cancel SIGIR (though in practice this would probably lead to an increase in compensatory air travel for meetings that take place at SIGIR).  A less drastic solution would be to move towards a model in which conference attendance is not compulsory for publication; something like the <a href="http://www.vldb.org/pvldb/">journalized conference form of the Proceedings of the VLDB Endowment</a>.  Journalization would reduce conference attendance (and resulting travel) to those who actually want to attend the conference (as well as removing a financial hurdle to publication for those who have limited finance).  Additionally, if a series of three conferences were associated, on equal footing, with the attached journal (perhaps SIGIR plus CIKM plus one other), and if one such conference was held in each region per year, distance flown by attendees would likely drop.</p>
<p>As a not-for-profit organization, SIGIR is not subject to the profit-maximizing imperatives of business entities.  The SIGIR community should be taking the lead in implementing practical solutions to reduce the environmental impact of its activities.  Restructuring the way SIGIR and associated conferences work can help to achieve this.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.codalism.com/?feed=rss2&#038;p=1745</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Statistical power of E-discovery validation</title>
		<link>http://blog.codalism.com/?p=1733</link>
		<comments>http://blog.codalism.com/?p=1733#comments</comments>
		<pubDate>Tue, 04 Sep 2012 16:41:50 +0000</pubDate>
		<dc:creator>william</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.codalism.com/?p=1733</guid>
		<description><![CDATA[My last post introduced the idea of the satisfiability of a post-production quality assurance protocol. We said that such a protocol is not satisfiable for a given size of the sample from the unretrieved (or null) set if the protocol were to fail the production even if the sample found no relevant documents. The reason [...]]]></description>
				<content:encoded><![CDATA[<p>My last post introduced the idea of the satisfiability of a post-production quality assurance protocol.  We said that such a protocol is not satisfiable for a given size of the sample from the unretrieved (or null) set if the protocol were to fail the production even if the sample found no relevant documents.  The reason a protocol could fail in such a circumstance is that the upper bound of the confidence interval on the missed relevant documents could still be above our threshold.<br />
<span id="more-1733"></span></p>
<p>Generally in statistics when we think about such tests, we look for something more than satisfiability.  Rather, we think about how confident we are that what we consider a sufficiently complete production would fail the protocol -- what is known as the <i>power</i> of the test.  Power stands in tension to the strictness of the test, the confidence required for passing the threshold.  The stricter our test, the less likely bad systems are to pass, but the more likely good systems are to fail.</p>
<p>Let's put this in concrete terms.  To make things simpler, we'll work initially in terms of what Herb Roitblat calls <a href="http://www.ediscoveryinstitute.org/publications/information_retrieval_and_ediscovery">elusion</a>: the proportion of non-produced documents that are relevant.  (This can be directly mapped to recall, but converting backwards and forwards is going to confuse us.)  Our QA protocol might then be stated in such terms:</p>
<blockquote><p>
    A production only passes QA if we have 95% confidence that elusion is no more than 1%.
</p></blockquote>
<p>Once we add to this a sample size, the protocol can be stated in terms of a minimum number of relevant documents that will fail the production if they are found in the sample.  So, for instance, if we have a sample of 1000 documents, then the above protocol will fail if 5 or more relevant documents are found in the sample; the upper-tailed exact 95% confidence interval on elusion with a sample of 5 relevant out of 1000 is 1.05%, whereas for 4 relevant it is 0.91%.  (For those who are savvy with the R statistical language, we calculate the interval using <tt>binom.test(5, 1000, alternative="less")</tt>, and look at the result marked "95% confidence interval".  The threshold value -- here 5 -- is calculated using <tt>qbinom(1 - 0.95, 1000, 0.01)</tt>; that is, the 5th percentile of the Binom(1000, 0.01) sampling distribution.)</p>
<p>So now we need to decide what actual sample size is appropriate for our protocol.  This decision hinges on two questions.  The first question is how confident we want to be that a good production would pass our protocol, what is known as the "power" of the test.  A power of 80% is a common choice.  The second question is what we would consider a "good" production. </p>
<p>It might seem at first blush that the answer to the question "what is a good production?" is "a good production is one with elusion of no more than 1%", which is what our protocol is testing for.  But in fact a production that has the threshold elusion of the protocol only has a 5% chance or less of passing that protocol; this is what gives us the 95% confidence that a passing production has threshold elusion or higher.  (If in fact you want to define 50% production as good, then you need to rethink your protocol, and set a lower rejection threshold.)</p>
<p>More generally, for a given sample size, we can state the probability that a production of a given level of performance will pass our protocol.  For instance, if our sample size is 1000, and the production has actual elusion 0.5% (half the threshold amount), then it still has a 56% chance of failing our protocol.  (In R, <tt>1 - pbinom(5 - 1, 1000, 0.005)</tt>, where 5 is threshold failure sample positives calculated earlier).  In fact, the maximum elusion that would have at least an 80% chance of passing our protocol is 0.23%, under a quarter of the threshold amount (in R, <tt>1 - qbeta(0.8, 1000 - (5 - 1) + 1, 5 - 1)</tt>, exploiting the relationship between the beta and binomial distributions).</p>
<p>What does this all mean for recall?  Well, given a retrieved and an unretrieved size, and assuming all retrieved documents have been verified relevant (and all known relevant documents are retrieved), then a recall value can be mapped into an elusion value, and used in the above calculations.  Let's therefore restate our QA protocol in terms of recall:</p>
<blockquote><p>
  A production only passes QA if we have 95% confidence that recall is no less than 50%.
</p></blockquote>
<p>We also need to specify what constitutes a good production, one that we want to be 80% confident will pass our protocol.  Let's specify that a production with actual recall of 75% is good.  Then we can ask, what sample size is required to satisfy our evaluation criteria for a given retrieval size?  The following graph gives the answer:</p>
<div id="attachment_1737" class="wp-caption aligncenter" style="width: 410px"><img src="http://blog.codalism.com/wp-content/uploads/2012/09/pwrsmplsz.png" alt="Sample size required for production size" title="pwrsmplsz" width="400" height="400" class="size-full wp-image-1737" /><p class="wp-caption-text">Sample size required for production size</p></div>
<p>... but it's a bit hard to read.  Let's show the same thing in a log-log graph (note the axes carefully):</p>
<div id="attachment_1738" class="wp-caption aligncenter" style="width: 410px"><img src="http://blog.codalism.com/wp-content/uploads/2012/09/pwrsmplsz-log.png" alt="Sample size required for production size (log-log)" title="pwrsmplsz-log" width="400" height="400" class="size-full wp-image-1738" /><p class="wp-caption-text">Sample size required for production size (log-log)</p></div>
<p>The line is approximately straight in the log-log graph because there's an inverse relationship between retrieval and sample size (doubling the retrieval size roughly halves the sample size required), at least while retrieval size is a small fraction of collection size.  In fact, <i>for this particular collection size, and threshold and "good" recall</i>, the approximate relationship is captured by the formula:</p>
<p><center><br />
 <span class='MathJax_Preview'><img src='http://blog.codalism.com/wp-content/plugins/latex/cache/tex_4a2a8b08ff31fc65b925985acdd111e5.gif' style='vertical-align: middle; border: none; ' class='tex' alt="\mbox{smpl.sz} \approx 6.4e6 / \mbox{retr.sz}" /></span><script type='math/tex'>\mbox{smpl.sz} \approx 6.4e6 / \mbox{retr.sz}</script> <br />
</center></p>
<p>(Don't pick sample sizes for actual protocols using this approximation, however: our test statistic -- relevant documents sampled -- is discrete, and inference from discrete statistics can be very sensitive to apparently minor differences in sample size.)</p>
<p>From the figure we can see that, if we are going to validate our production with a post-production sample of 2,400 from the unretrieved set, then there's no point performing the validation unless our retrieval has found at least 2,700-odd relevant documents, because otherwise even a good production can fail the validation.  And, if we set the upper bound on the number of documents we're prepared to look at during validation to 10,000, then productions of less than 640 documents essentially cannot be validated, unless we're prepared to loosen our validation requirements.  If, for instance, we set the threshold recall at 30%, and require only 90% confidence in beating this threshold (while retaining 75% recall as our definition of an "good" retrieval), then required sample size falls from around 10,000 to below 2,000.  But we'd want to be sure that we're willing to accept the reduced confidence and lower threshold.</p>
<p>In summary, the feasibility of using sampling-based methods to validate the recall of an e-discovery production is highly sensitive to production size.  For small productions, strict protocols will require an impractically large sample size.  Protocol parameters should be determined with an eye to expected collection yield and production size, and where expectation or (production-independent) evidence suggests relevant documents are rare, alternative levels or methods of production validation must be considered.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.codalism.com/?feed=rss2&#038;p=1733</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Meaningful QA sample size in e-discovery</title>
		<link>http://blog.codalism.com/?p=1705</link>
		<comments>http://blog.codalism.com/?p=1705#comments</comments>
		<pubDate>Mon, 13 Aug 2012 02:09:45 +0000</pubDate>
		<dc:creator>william</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.codalism.com/?p=1705</guid>
		<description><![CDATA[In my last post, I examined the live-blog e-discovery production being performed by Ralph Losey, and asked what lower limit we could place on the recall of highly relevant documents with 95% confidence based on the final, quality assurance sample. The QA sample drew 1065 documents from the null set (that is, the set of [...]]]></description>
				<content:encoded><![CDATA[<p>In my last post, I examined the live-blog e-discovery production being performed by Ralph Losey, and asked what lower limit we could place on the recall of highly relevant documents with 95% confidence based on the final, quality assurance sample.  The QA sample drew 1065 documents from the null set (that is, the set of documents that were not slated for production).  Although none of these documents were highly relevant, this still only allows us to say with 95% confidence that no more than 0.281% of the null documents are highly relevant.  Since there are 698,423 null documents, this represents an upper bound of 1962 highly relevant documents that have been missed.  As only 18 highly relevant documents were found in the production, Ralph's lower-bound recall is 1%.  To get this lower-bound recall up to 50%, you'd need to sample around 100,000 documents from the null set without finding any highly relevant.<br />
<span id="more-1705"></span></p>
<p>Such figures might cause us to question the usefulness of quality assurance sampling.  How reasonable is it to include assurance requirements that force us to review tens of thousands of documents to verify that we've only found a highly relevant handful because highly relevant documents are rare?  One way to address the question of effort required is to set up a simple protocol, and see how big a sample is necessary for this protocol to be satisfiable.  Let's say the requirement is that we have 95% confidence in a lower-bound of 50% recall.  And, to keep things simple, let's say that we'll select a QA sample large enough to meet this requirement, provided none of the documents in it prove relevant (at whatever threshold of relevance we're testing).  (This is a primitive form of [edit: power] analysis, adequate only for the sake of our current discussion. To do things properly, we need also to consider the likelihood of a sample appear for an hypothesized level of "acceptable" system performance.  Don't choose sample sizes for QA of actual productions based only on the discussion here!)</p>
<p>With these assumptions, we quickly realize that there are two factors that determine the required size of the QA sample: the size of the null set, and the size of the production (assuming that the production consists of all and only manually-verified relevant documents).  Let  <span class='MathJax_Preview'><img src='http://blog.codalism.com/wp-content/plugins/latex/cache/tex_4b43b0aee35624cd95b910189b3dc231.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="r" /></span><script type='math/tex'>r</script>  be the size of the production, and  <span class='MathJax_Preview'><img src='http://blog.codalism.com/wp-content/plugins/latex/cache/tex_8d9c307cb7f3c4a32822a51922d1ceaa.gif' style='vertical-align: middle; border: none; padding-bottom:1px;' class='tex' alt="N" /></span><script type='math/tex'>N</script>  the size of the null set.  The sample size  <span class='MathJax_Preview'><img src='http://blog.codalism.com/wp-content/plugins/latex/cache/tex_7b8b965ad4bca0e41ab51de7b31363a1.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="n" /></span><script type='math/tex'>n</script>  required for an all-negative sample to satisfy our protocol is:</p>
<p><center><br />
 <span class='MathJax_Preview'><img src='http://blog.codalism.com/wp-content/plugins/latex/cache/tex_22d35b25602df9c776fdeb342dd06ccf.gif' style='vertical-align: middle; border: none; ' class='tex' alt="n = \log_{1 - r/N}0.05" /></span><script type='math/tex'>n = \log_{1 - r/N}0.05</script> <br />
</center></p>
<p>Setting the null size set at 700,000, as for Ralph's production,  this equates to the following figure:</p>
<div id="attachment_1710" class="wp-caption aligncenter" style="width: 310px"><img src="http://blog.codalism.com/wp-content/uploads/2012/08/smplsz.png" alt="Sample size for production size" title="smplsz" width="300" height="300" class="size-full wp-image-1710" /><p class="wp-caption-text">Sample size for production size</p></div>
<p>Required sample size drops precipitously with increased production size; so precipitously that it is difficult to read off corresponding values.  Plotting on a log-log graph makes the correspondence clearer; the relationship becomes a straight line:</p>
<div id="attachment_1711" class="wp-caption aligncenter" style="width: 310px"><img src="http://blog.codalism.com/wp-content/uploads/2012/08/smplsz-log.png" alt="Sample size for production size (log-log)" title="smplsz-log" width="300" height="300" class="size-full wp-image-1711" /><p class="wp-caption-text">Sample size for production size (log-log)</p></div>
<p>While a production of 20 documents requires a sample size around 100000 for our QA protocol to be satisfiable, a sample size of 1000 documents is sufficient once the production exceeds 2000 documents.  The size of the null set also effects minimum protocol-satisfiable sample size; where  <span class='MathJax_Preview'><img src='http://blog.codalism.com/wp-content/plugins/latex/cache/tex_eec41fc9fe178a583911305266e71296.gif' style='vertical-align: middle; border: none; ' class='tex' alt="r \ll N" /></span><script type='math/tex'>r \ll N</script> , doubling the null set size roughly doubles the sample size required.  Nevertheless, if the production is large enough, then the our QA protocol is satisfiable with a feasibly-sized sample.</p>
<p>In short, the informativeness of a QA sample depends upon the characteristics of topic and collection.  If relevant documents are rare, then even an effective production will be small; meanwhile, sampling can set only so low an upper bound on the proportion of relevant documents in the null set; and if the collection itself is large, then the plausible number of missed relevant documents will overwhelm the number actually found.  If relevant documents are less rare, though, and if the collection size has been kept under control, then productions will be larger, plausibly missed documents will be fewer, and QA sampling is able to confirm a satisfactory level of recall.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.codalism.com/?feed=rss2&#038;p=1705</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Quality assurance samples and prior beliefs</title>
		<link>http://blog.codalism.com/?p=1699</link>
		<comments>http://blog.codalism.com/?p=1699#comments</comments>
		<pubDate>Wed, 08 Aug 2012 07:56:58 +0000</pubDate>
		<dc:creator>william</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.codalism.com/?p=1699</guid>
		<description><![CDATA[Those who are following Ralph Losey's live-blogged production of material on involuntary termination from the EDRM Enron collection will know that he has reached what was to be the quality assurance step (though he has decided to do at least one more iteration of production for the sake of scientific verification). Quality assurance here involves [...]]]></description>
				<content:encoded><![CDATA[<p>Those who are following Ralph Losey's <a href="http://e-discoveryteam.com/2012/08/05/day-nine-of-a-predictive-coding-narrative-a-scary-search-for-false-negatives-a-comparison-of-my-car-with-the-griswolds-and-a-moral-dilemma/#comments">live-blogged production of material on involuntary termination from the EDRM Enron collection</a> will know that he has reached what was to be the quality assurance step (though he has decided to do at least one more iteration of production for the sake of scientific verification).   Quality assurance here involves taking a final sample from the part of the collection that is not to be produced -- what Ralph terms the "null set" -- and checking to see if any relevant documents have been missed.  The outcome of this QA sample has led to an interesting discussion between Ralph and Gord Cormack on the use and meaning of confidence intervals, and how sure we can really be that (almost) no relevant documents have been missed.   I've commented on the discussion at Ralph's blog; I though it would not be amiss to expand upon those comments here.<br />
<span id="more-1699"></span></p>
<p>In his production, Ralph is separately coding relevant and highly relevant documents.  Let's look at the case of estimating the number of highly relevant documents that may have been missed by the production; the numbers involved are smaller, and more extreme cases of the sampling and estimation procedure are tested.  </p>
<p>Ralph's production of relevant and highly relevant documents can be divided into three main stages:</p>
<ol>
<li>An initial simple random sample of 1507 documents from the full collection of 699,082 documents, none of which sampled documents were found to be highly relevant.</li>
<li>Several iterations of search, predictive coding, and review, culminating in a production list of 659 relevant documents, of which 18 were highly relevant.  All of these 659 documents have been reviewed by Ralph and confirmed relevant.  This leaves  699,082 - 659 = 698,423 documents in the null set.  Some of the null set have also been reviewed by Ralph and found irrelevant, but that information is ignored for the next step.</li>
<li>A final random sample of 1065 from the null set, of which 0 were found to be highly relevant.  We'll refer to this as the quality assurance (QA) sample.</li>
</ol>
<p>The question now is, how many highly relevant documents may Ralph have missed in the null set, and thus have failed to produced?  Ralph observes that since the QA sample produced no highly relevant documents, our best estimate is that no highly relevant documents have been missed; if we add in the evidence of the initial sample, which also found no highly relevant documents, and of the production process, which only found 18 highly relevant documents, then it seems likely that very few, if any, have been missed.</p>
<p>At this point, though, Gordon observes that if we calculate a confidence interval on the number of highly relevant documents, we get rather a different story.  Gordon calculate a two-tailed 95% confidence interval, but since we're looking for a probabilistic upper bound (and the lower bound is evidently going to be 0), I'll use a one-tailed upper confidence interval.  Using the exact (Clopper-Pearson) interval on a proportion, the one-tailed 95% confidence upper bound on the proportion of highly relevant documents, based upon finding 0 in a sample of 1065, is 0.281%.  Given there are 698,423 documents in the null set, this amounts to 1956 highly relevant documents.  Thus, based on this final sample alone, the most we can say with 95% confidence is that we've found 18 / (18 + 1956) = 1% of the highly relevant documents.  This is not, on the surface, an encouraging finding.</p>
<p>How do we manage to conjure the ghosts of 1956 highly relevant documents potentially lurking in the null set from a sample containing 0 such documents?  We work backwards as follows.  Assume that 0.281% of the null set was in fact highly relevant; that is, that 99.719% of the null set is not highly relevant.  What is the probability of sampling 1065 documents from a set in which 99.719% are not highly relevant and having no highly relevant documents in the sample?  It is 0.99719 ^ 1065 = 5%.  Thus, our upper bound on the 95% interval is the proportion relevant for which our observed sample result (or fewer -- but you can't have fewer than 0) has a 5% chance of occurring.</p>
<p>At this stage, however, as Ralph observes, something seems to be off in our reasoning.  We did an initial random sample of 1507 documents, of which none were highly relevant.  The 95% upper bound on that sample 0.198%, or 1388 highly relevant documents.  We then did a production and found 18 highly relevant documents.  Now we're estimating an upper bound of 1956 highly relevant documents.  Why have we gone backwards?</p>
<p>The issue is that one cannot in general simply add up the evidence from the two samples, especially when the first sample has been used to help form the production that the second sample is evaluating.  Care needs to be taken in assessment protocols to make sure that sample evidence does not become "polluted" by being entangled in process being measured; this is expressed in the maxim "separate training and testing data".</p>
<p>In fact, for this particular protocol, we <i>can</i> more or less add the two samples together, provided we assume that the production has done at least as well as random (and so has avoided "pushing" highly relevant documents out of the production into the null set) -- and, given that 2.7% of the production is highly relevant, this is a reasonable enough assumption.  (And the production is too small to have much impact in any case).  So we can add the 1507 to the 1065 sample to derive a sample size of 2572, and a 95% confidence upper bound of 813 highly relevant documents left in the null set.  (We're violating the strict assumptions of the exact binomial interval here, but the answer is a very good approximation.)  But even with this reduced upper bound on the highly-relevant document missed, we've still only got a 95% lower bound of 18 / (18 + 813) = 2% recall.</p>
<p>What about the fact that an expert searcher (Ralph) and a reputable predictive coding system (Kroll OnTrack's inView) have made several iterations over the document collection, and that they've only been able to find 18 highly relevant documents?  Does this influence our confidence?  Put another way, had the production been performed instead by randomly picking 659 documents, and a QA sample on the null set produced the same result of no highly relevant documents, would we assign the same probabilistic upper-bound estimate?  The answer to the latter two questions are, respectively, "no", and "yes": no, the fact that a careful production been performed doesn't change our estimate; and yes, we'd give the same upper bound (for the same sample evidence) even if the production had been made at random.</p>
<p>It is not that our beliefs (prior to the sample) about the thoroughness of the production are irrational or have no evidentiary basis.  Subjectively, they have a reasonable foundation (though we'd be wise not to place too much trust in our own judgment).  Nor is it merely that we don't want to allow our estimate to depend upon assumptions about the quality of the process being estimated (though this is certainly part of the issue).  The central problem is that we don't know how objectively to quantify and justify the probabilities associated with our prior beliefs and the evidence they are based on.  We can't model the process by which the production has found the highly relevant documents it has, and so we can't say how likely it is that it has missed others.  Random sampling, however, follows a simple selection process that we can model and reason, probabilistically, about.  Chance is predictable; choice (by machine or human) is not: that is the foundational insight of modern statistical science.</p>
<p>Nevertheless, the self-imposed amnesia involved in the final QA sample emphasizes that such a sample does not, by itself, constitute an adequate assurance of the quality of a production.  (Put another way, the production could be quite inept, and there still be a good chance that the ineptitutde would be missed by the QA sample, if relevant documents are rare enough.)  Rather, such QA sampling needs to take place as part of a proven production protocol, one which incorporates various assessment checks, even if their evidence cannot formally be combined into a single confidence interval.  The development of a standard for such protocols, as <a href="http://e-discoveryteam.com/2012/06/17/jason-barons-keynote-speech-boldly-going-where-few-judges-have-gone-before-the-emerging-case-law-on-software-assisted-document-review-and-our-next-5-year-mission/">recently proposed by Jason Baron through an ANSI working group</a>, can guide practitioners towards best practice, and alleviate them from the responsibility of having to roll their own processes.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.codalism.com/?feed=rss2&#038;p=1699</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Tutorial on confidence intervals in e-discovery</title>
		<link>http://blog.codalism.com/?p=1695</link>
		<comments>http://blog.codalism.com/?p=1695#comments</comments>
		<pubDate>Thu, 02 Aug 2012 01:10:44 +0000</pubDate>
		<dc:creator>william</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.codalism.com/?p=1695</guid>
		<description><![CDATA[Ever since Judge Grimm opined that random sampling constituted prudent for checking the reliability of a production (Victor Stanley v. Creative Pipe, 269 F.R.D. 497), there has been strong interest in the topic of sampling within e-discovery, including from lawyers themselves. Ralph Losey, for interest, has devoted a post in his blog to the topic [...]]]></description>
				<content:encoded><![CDATA[<p>Ever since Judge Grimm opined that random sampling constituted prudent for checking the reliability of a production (Victor Stanley v. Creative Pipe, 269 F.R.D. 497), there has been strong interest in the topic of sampling within e-discovery, including from lawyers themselves.  Ralph Losey, for interest, has devoted a post in his blog to the <a href="http://e-discoveryteam.com/2012/05/06/random-sample-calculations-and-my-prediction-that-300000-lawyers-will-be-using-random-sampling-by-2022/">topic of sampling</a>, and his recent blog posts narrating an <a href="http://e-discoveryteam.com/2012/07/08/day-two-of-a-predictive-coding-narrative-more-than-a-random-stroll-down-memory-lane/">example predictive coding exercise</a> have contained much sampling-related material.</p>
<p>I've written some <a href="http://blog.codalism.com/?p=1592">research work</a> on more advanced topics in confidence intervals, but I thought it might be useful to write some more introductory material as well.  I originally intended to write a series of blog posts giving a brief tutorial on sampling and estimation, but the brief tutorial worked out to be around 5,000 words, so I've made it into a separate document: <a href="http://www.umiacs.umd.edu/~wew/papers/sisa.pdf">A tutorial on interval estimation for a proportion, with particular reference to e-discovery</a>.  The tutorial aims to give an understanding of the workings behind confidence intervals, while avoiding as much math as possible.  (If you want a still more high-level discussion of sampling, estimation, and intervals, then I recommend Venkat Rangan's post on <a href="http://www.clearwellsystems.com/e-discovery-blog/2012/07/06/predictive-coding-measurement-challenges-electronic-discovery/">Predictive Coding -- Measurement Challenges</a>.)  The tutorial is marked as Version 0.1; I'd be very grateful for any corrections, comments, or suggestions for improvement, and will work them in to later versions.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.codalism.com/?feed=rss2&#038;p=1695</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Do document reviewers need legal training?</title>
		<link>http://blog.codalism.com/?p=1609</link>
		<comments>http://blog.codalism.com/?p=1609#comments</comments>
		<pubDate>Sun, 15 Jul 2012 03:46:31 +0000</pubDate>
		<dc:creator>william</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.codalism.com/?p=1609</guid>
		<description><![CDATA[In my last post, I discussed an experiment in which we had two assessors re-assess TREC Legal documents with less and more detailed guidelines, and found that the more detailed guidelines did not make the assessors more reliable. Another natural question to ask of these results, though not one the experiment was directly designed to [...]]]></description>
				<content:encoded><![CDATA[<p>In my <a href="http://blog.codalism.com/?p=1663">last post</a>, I discussed an experiment in which we had two assessors re-assess TREC Legal documents with less and more detailed guidelines, and found that the more detailed guidelines did not make the assessors more reliable.  Another natural question to ask of these results, though not one the experiment was directly designed to answer, is how well our assessors compared with the first-pass assessors employed for TREC, who for this particular topic (Topic 204 from the 2009 Interactive task) happened to be a review team from a vendor of professional legal review services.  How well do our non-professional assessors compare to the professionals?<br />
<span id="more-1609"></span><br />
To answer this question, I'll take the official TREC qrels as gold standard, as Maura Grossman and Gord Cormack do in their paper <a href="http://jolt.richmond.edu/v17i3/article11.pdf">comparing technology-assisted with manual review</a>.  These qrels are derived from the first-pass assessments after alleged errors have been appealed by participants and adjudicated by the topic authority (see the <a href="http://trec.nist.gov/pubs/trec18/papers/LEGAL09.OVERVIEW.pdf">TREC 2009 Legal Track overview</a> for more details).  This topic authority is also the author of the detailed relevance criteria used by the TREC assessors and (in the second batch) by our experimental assessors in performing their assessments.  We'll measure reliability using mutual  <span class='MathJax_Preview'><img src='http://blog.codalism.com/wp-content/plugins/latex/cache/tex_bc6b0efd3bed4dfabe15757cf4089d87.gif' style='vertical-align: middle; border: none; ' class='tex' alt="F_1" /></span><script type='math/tex'>F_1</script>  score (also known as positive agreement) between the TREC or experimental assessors on the one hand, and the official assessments on the other.  (Cohen's  <span class='MathJax_Preview'><img src='http://blog.codalism.com/wp-content/plugins/latex/cache/tex_269cb4a8704d5fb203ad10436efe52d1.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="\kappa" /></span><script type='math/tex'>\kappa</script>  is an alternative measure, but I prefer mutual  <span class='MathJax_Preview'><img src='http://blog.codalism.com/wp-content/plugins/latex/cache/tex_bc6b0efd3bed4dfabe15757cf4089d87.gif' style='vertical-align: middle; border: none; ' class='tex' alt="F_1" /></span><script type='math/tex'>F_1</script>  for the current purposes as it is more easily interpretable in terms of retrieval effectiveness.  A discussion of the two measures can be found in Section 3.3.2 of our draft survey paper on <a href="http://ediscovery.umiacs.umd.edu/pub/ow12fntir.pdf">Information Retrieval for E-Discovery</a>.)</p>
<blockquote><p>
There's a slightly tricky question about sampling rates and gold standard reliability to consider also.  (Readers not interested in statistical niceties can skip this and the next paragraph.) The sample for our experiment was designed to produce an even balance of officially relevant and officially irrelevant documents, since this provided the greatest statistical power for the comparison we were making (in that case using Cohen's  <span class='MathJax_Preview'><img src='http://blog.codalism.com/wp-content/plugins/latex/cache/tex_269cb4a8704d5fb203ad10436efe52d1.gif' style='vertical-align: middle; border: none; padding-bottom:2px;' class='tex' alt="\kappa" /></span><script type='math/tex'>\kappa</script> ).  As it happens, this sampling also increases the concentration of appealed documents.  So directly comparing results on only the experimental sample would be unfair to the TREC assessors.  Instead, we perform a sample-based extrapolation from the experimental assessments to the sample drawn for assessment at TREC (though not to the full corpus, since this would increase estimate variance without fundamentally changing estimate accuracy).</p>
<p>A second nicety is in the handling of what I call the "bottom stratum", of documents retrieved by no TREC participant.  This stratum was very lightly sampled from the TREC to the experimental sample (as it was from the corpus into the TREC sample), so (in extrapolation) each document here has a big impact upon comparisons of effectiveness.  At the same time, no team had an incentive to (and no team did) appeal assessments of irrelevance in this stratum, so any false negatives (actually relevant documents assessed as irrelevant) will have been missed.  Thus, including the bottom stratum potentially biases the comparison in favour of the TREC assessors.  I therefore report results both with and without the bottom stratum considered.
</p></blockquote>
<p>The statistical prologemena done with, here are our results.  The original experiment involved three treatments: a first batch with the topic statement only; a second batch with the detailed criteria of relevance; and then both batches jointly re-assessed by both assessors in a (mostly successful) attempt to reach agreement on relevance.  Below, we report mutual  <span class='MathJax_Preview'><img src='http://blog.codalism.com/wp-content/plugins/latex/cache/tex_bc6b0efd3bed4dfabe15757cf4089d87.gif' style='vertical-align: middle; border: none; ' class='tex' alt="F_1" /></span><script type='math/tex'>F_1</script>  scores with the official qrels for each batch.  The TREC assessments were not (and cannot) be divided into the same treatments, so only the one  <span class='MathJax_Preview'><img src='http://blog.codalism.com/wp-content/plugins/latex/cache/tex_bc6b0efd3bed4dfabe15757cf4089d87.gif' style='vertical-align: middle; border: none; ' class='tex' alt="F_1" /></span><script type='math/tex'>F_1</script>  score is reported for all three treatments.  (Put another way, each batch provides an estimate on the full evaluation sample under different treatments).  The TREC assessors worked independently to the topic authority's detailed guidelines; this therefore is the fairest comparison with our experimental assessors.  I also show the unextrapolated  <span class='MathJax_Preview'><img src='http://blog.codalism.com/wp-content/plugins/latex/cache/tex_bc6b0efd3bed4dfabe15757cf4089d87.gif' style='vertical-align: middle; border: none; ' class='tex' alt="F_1" /></span><script type='math/tex'>F_1</script>  score of the strongest participant team for this topic (the industry team included in Grossman and Cormack's analysis).</p>
<p>Here then, finally, are the results:</p>
<table cellpadding="5px" border="1px">
<tr>
<th rowspan="2">Document set</th>
<th rowspan="2">Batch</th>
<th colspan="4" align="center">Assessor</th>
</tr>
<tr>
<th>Exp-A</th>
<th>Exp-B</th>
<th>TREC</th>
<th>Team-I</th>
</tr>
<tr>
<td>With bottom stratum</td>
<td>Topic</td>
<td>0.83</td>
<td>0.22</td>
<td>&darr;</td>
<td>&darr;</td>
</tr>
<tr>
<td>&nbsp;</td>
<td>Guidelines</td>
<td>0.73</td>
<td>0.26</td>
<td>0.33</td>
<td>0.89</td>
</tr>
<tr>
<td>&nbsp;</td>
<td>Joint</td>
<td>0.68</td>
<td>0.71</td>
<td>&uarr;</td>
<td>&uarr;</td>
</tr>
<tr>
<td>Without bottom stratum</td>
<td>Topic</td>
<td>0.83</td>
<td>0.68</td>
<td>&darr;</td>
<td>&darr;</td>
</tr>
<tr>
<td>&nbsp;</td>
<td>Guidelines</td>
<td>0.75</td>
<td>0.59</td>
<td>0.35</td>
<td>0.89</td>
</tr>
<tr>
<td>&nbsp;</td>
<td>Joint</td>
<td>0.70</td>
<td>0.73</td>
<td>&uarr;</td>
<td>&uarr;</td>
</tr>
</table>
<p>We can see that, if the bottom stratum is excluded, both our assessors (Exp-A and Exp-B) outperform the TREC assessor.  For each of the two batches, Exp-B found 2 officially irrelevant documents to be relevant, so if this stratum is included in the comparison, Exp-B's score is depressed to below that of the TREC assessor (note the strong sampling effect in this stratum), though assessor Exp-A's and the joint-review score is unaffected.  It was noted above that including the bottom stratum is potentially biased in favour of the TREC assessors, though it should be said that for the four documents in question, our assessors jointly concluded that they weren't in fact relevant (and, having viewed them myself, I concur).</p>
<p>I've not checked the above results for statistical significance.  Given the sampling complexities involved, that would be a tricky thing to do.  My suspicion is that the differences without the bottom stratum may be significant, but with the bottom stratum likely are not, due to the much greater sampling variability in the latter case.  More to the point, though, this is only a single topic, in an experiment not directly set up to measure the comparison made here (and where even the official assessments might contain some remaining errors or ambiguities); even a finding of statistical significance would not show that our assessor A is "provably more reliable" than the TREC assessors.  There are also differences of conditions involved; our assessors worked three-hour shifts, assessing around 60 documents (assessor A) to 30 documents (assessor B) an hour, which are less fatiguing conditions than perhaps are typical in large-scale manual review (though the conditions of the professional TREC review are not specified in the TREC overview document).  The conclusion that can be reached, though, is that our assessors were able to achieve reliability (with or without detailed assessment guidelines) that is competitive with that of the professional reviewers -- and also competitive with that of a commercial e-discovery vendor.</p>
<p>Time to meet our crack reviewers:</p>
<div id="attachment_1613" class="wp-caption alignleft" style="width: 179px"><img class="size-full wp-image-1613   " title="bryan" src="http://blog.codalism.com/wp-content/uploads/2012/04/bryan.jpg" alt="Intern-reviewer Bryan" width="169" height="210" /><p class="wp-caption-text">Intern-reviewer Bryan</p></div>
<div id="attachment_1614" class="wp-caption alignleft" style="width: 182px"><img class="size-full wp-image-1614  " title="marjorie" src="http://blog.codalism.com/wp-content/uploads/2012/04/marjorie.jpg" alt="Intern-reviewer Marjorie" width="172" height="213" /><p class="wp-caption-text">Intern-reviewer Marjorie</p></div>
<p><br clear="both"></p>
<p>At the time of the experiment, Marjorie and Bryan were high school seniors working in the E-Discovery lab as interns four mornings a week.  (Since then they have become ex-high-school students waiting for their summer holidays to end and their studies at the University of Maryland to commence.) They have no legal training, and no prior e-discovery experience, aside from assessing a few dozen documents for a different TREC topic as part of a trial experiment.  They performed assessments on TIFF files, displayed on rather underpowered laptops that couldn't fit the full TIFF onto the screen and tended to crash every now and then.  They worked independently and without supervision or correction, though one would be correct to describe them as careful and motivated.</p>
<p>All of this raises the question that is posed in the subject of this post: if (some) high school students are as reliable as (some) legally-trained, professional e-discovery reviewers, then is legal training a practical (as opposed to legal) requirement for reliable first-pass review for responsiveness?   Or are care and general reading skills the more important factors?</p>
<p>As it happens, the same question has been addressed in a couple of other recent studies, which reached similar results to our own, and thus add some support to our tentative finding, namely: that legal training does not confer expertise in document review.  In <a href="http://onlinelibrary.wiley.com/doi/10.1002/meet.14504701157/abstract">A User Study of Relevance Judgments for E-Discovery</a> (Proc. ASIST, 2010), Jianqiang Wang and Dagobert Soergel had four law school and four library and information studies (LIS) students re-assess documents from four TREC topics.  In an exit interview, the law students stated that their legal training and experience was helpful in performing their assessments, whereas the LIS students felt such experience would not have helped much.  The LIS students appear to have been correct: there was little or no difference between the two assessor groups in reliability or speed.  In another study, <a href="http://onlinelibrary.wiley.com/doi/10.1002/meet.2008.14504503126/abstract">Legal Discovery: Does Domain Expertise Matter?</a> (Proc. ASIST, 2008), Efthimis N. Efthimiadis and Mary A. Hotchkiss had six groups of MLIS students build a run for the TREC 2007 interactive task.  Two of these groups consisted of law librarianship students with JD degrees and professional experience as lawyers or legal searchers; but they achieved no greater reliability than the remaining four groups, who had no legal experience.</p>
<p>Some TREC 2009 topics were initially assessed by law students, other by professional reviewers.  The difference in expertise level can be considered here, though with two caveats: first, we're comparing assessors on different topics, and variability in the assessment difficulty of topics appears to be high (see Table 3.5 of our <a href="http://ediscovery.umiacs.umd.edu/pub/ow12fntir.pdf">draft survey</a>); and second, we don't know how the professional reviewers conducted their reviews, and there could be differences in process as well as raw expertise.  Of the five topics in the TREC 2009 Legal Interactive task that were appealed with sufficient thoroughness for the us to be confident that most errors were found (see my SIRE 2011 paper, <a href="http://www.umiacs.umd.edu/~wew/papers/w11sire.pdf">Re-examining the Effectiveness of Manual Review</a>, for a fuller discussion), three were initially assessed by professional review teams, two by volunteer law students.  The picture of the benefits of expertise is mixed for this dataset.  If we consider only the documents actually assessed (without making a variance-increasing extrapolation to the corpus), then the law students seem roughly as reliable as the professional reviewers (Table 4 of <a href="http://www.umiacs.umd.edu/~wew/papers/w11sire.pdf">Re-examining</a>), though if we do extrapolate to the full population, one of the professional review teams does outperform the two student and other two professional teams in both reliability (ibid., Table 2) and consistency (ibid., Figure 2) -- though perhaps this is explained by good process (and care in recruitment), rather than greater legal training.</p>
<p>Now, even if we take at face value the finding that legal training does little to improve reliability in first-pass document review, this does not mean that there are not systematic differences in the skills of reviewers (some will be more careful and attentive readers than others, for instance), or the accuracy of professional review teams (depending upon the quality of their processes and their care in recruiting said skilled reviewers).  I'd not recommend rounding up high school students on summer break to save review costs.  And of course legal training and expertise are required to frame, monitor, and interpret the production.  Our finding might suggest that first-pass document review is not practice of the law (since it apparently does not employ legal skills), and therefore that using non-lawyers to perform it should not be considered <a href="http://e-discoveryteam.com/2012/06/23/going-all-out-for-predictive-coding-and-vendor-cost-savings/">unauthorized practice of the law</a>, but I'll offer a bottle of wine to the lawyer brave enough to argue that in court on the basis of this blog post.  Perhaps in the construction of e-discovery test collections, we do not have to insist that all reviewers have legal training, though again this depends upon the acceptability of this to the practicing e-discovery community.  The research implications are clearer: we don't know what makes a reliable reviewer; legal training doesn't appear to play a major part, which is a negative outcome; but that in compensation means that findings on the conception and perception of relevance from outside the e-discovery domain can with more confidence be applied within it.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.codalism.com/?feed=rss2&#038;p=1609</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
	</channel>
</rss>
