<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments for IREvalEtAl</title>
	<atom:link href="http://blog.codalism.com/?feed=comments-rss2" rel="self" type="application/rss+xml" />
	<link>http://blog.codalism.com</link>
	<description>William Webber's Research Blog</description>
	<lastBuildDate>Sun, 19 May 2013 22:33:34 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
	<item>
		<title>Comment on Stratified sampling in e-discovery evaluation by Robots From The Not-Too-Distant Future Explain How They Use Random Sampling For Artificial Intelligence Based Evidence Search &#124; e-Discovery Team ®</title>
		<link>http://blog.codalism.com/?p=1767&#038;cpage=1#comment-286296</link>
		<dc:creator>Robots From The Not-Too-Distant Future Explain How They Use Random Sampling For Artificial Intelligence Based Evidence Search &#124; e-Discovery Team ®</dc:creator>
		<pubDate>Sun, 19 May 2013 22:33:34 +0000</pubDate>
		<guid isPermaLink="false">http://blog.codalism.com/?p=1767#comment-286296</guid>
		<description><![CDATA[[...] invaluable, including such esoteric topics as Gaussian and Binomial calculations, Simple Random and Stratified Random sampling (William&#8217;s speciality), quality control sampling for testing, as opposed to training, [...]]]></description>
		<content:encoded><![CDATA[<p>[...] invaluable, including such esoteric topics as Gaussian and Binomial calculations, Simple Random and Stratified Random sampling (William&#8217;s speciality), quality control sampling for testing, as opposed to training, [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on What is the maximum recall in re Biomet? by william</title>
		<link>http://blog.codalism.com/?p=1808&#038;cpage=1#comment-280623</link>
		<dc:creator>william</dc:creator>
		<pubDate>Mon, 06 May 2013 14:16:16 +0000</pubDate>
		<guid isPermaLink="false">http://blog.codalism.com/?p=1808#comment-280623</guid>
		<description><![CDATA[James,

I&#039;d committed a calculation error myself -- the null set was 15.6 million, not 16.6 million, document (corrected in post).  This would be undeduplicated, if I read the affidavit correctly.  That still means that around 40% of the responsive material is excluded by the keyword filter.]]></description>
		<content:encoded><![CDATA[<p>James,</p>
<p>I'd committed a calculation error myself -- the null set was 15.6 million, not 16.6 million, document (corrected in post).  This would be undeduplicated, if I read the affidavit correctly.  That still means that around 40% of the responsive material is excluded by the keyword filter.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on What is the maximum recall in re Biomet? by James Keuning</title>
		<link>http://blog.codalism.com/?p=1808&#038;cpage=1#comment-280422</link>
		<dc:creator>James Keuning</dc:creator>
		<pubDate>Mon, 06 May 2013 00:16:23 +0000</pubDate>
		<guid isPermaLink="false">http://blog.codalism.com/?p=1808#comment-280422</guid>
		<description><![CDATA[Thanks for writing this.  I am still chewing on the numbers.  I have asked a few questions on some other blogs and haven&#039;t gotten a response (http://bit.ly/15mqa5f).

One questions about your numbers - you say that the 16.6 million were sampled.  My understanding is that only the 15.5 null set was sampled - the 1.4 million documents which were duplicated out were not sampled.  Thus we do not know the responsiveness of the duped out docs.  Perhaps they were 100% responsive, perhaps zero!

&quot;The remaining 15,576,529 documents not selected by keywords (referred to as the “null set”)...&quot;

&quot;To obtain the relevance rate of the null set, Biomet reviewed a random sample of 4,146 documents...&quot;]]></description>
		<content:encoded><![CDATA[<p>Thanks for writing this.  I am still chewing on the numbers.  I have asked a few questions on some other blogs and haven't gotten a response (<a href="http://bit.ly/15mqa5f" rel="nofollow">http://bit.ly/15mqa5f</a>).</p>
<p>One questions about your numbers - you say that the 16.6 million were sampled.  My understanding is that only the 15.5 null set was sampled - the 1.4 million documents which were duplicated out were not sampled.  Thus we do not know the responsiveness of the duped out docs.  Perhaps they were 100% responsive, perhaps zero!</p>
<p>"The remaining 15,576,529 documents not selected by keywords (referred to as the “null set”)..."</p>
<p>"To obtain the relevance rate of the null set, Biomet reviewed a random sample of 4,146 documents..."</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on What is the maximum recall in re Biomet? by Greg A.</title>
		<link>http://blog.codalism.com/?p=1808&#038;cpage=1#comment-279373</link>
		<dc:creator>Greg A.</dc:creator>
		<pubDate>Wed, 01 May 2013 20:58:18 +0000</pubDate>
		<guid isPermaLink="false">http://blog.codalism.com/?p=1808#comment-279373</guid>
		<description><![CDATA[Putting aside the issue of the incorrect math, I think you rightly note that this case hinges on the efficacy of the keyword searches.   I think it is a bit too broad of a brush to say that search terms will always exclude a (unreasonably) large amount of relevant documents. 

There are widely known methodologies to test and improve search term recall and precision.  Used correctly, I think that search terms are still an efficient and defensible means to cull the review universe prior to the application of predictive coding.  I don&#039;t think you will hear that very much from the vendor side as it doesn&#039;t sell per gigabyte predictive coding charges.

In this case, what is not clear is whether any of these techniques were used and there is not any analysis that the search terms&#039; recall was reasonable in proportion to the additional cost necessary to improve the recall.

I&#039;ll let Losey speak as to the lower bounds of acceptable recall for either search terms or predictive coding, but my two cents is that the test should be on the basis of proportionality.]]></description>
		<content:encoded><![CDATA[<p>Putting aside the issue of the incorrect math, I think you rightly note that this case hinges on the efficacy of the keyword searches.   I think it is a bit too broad of a brush to say that search terms will always exclude a (unreasonably) large amount of relevant documents. </p>
<p>There are widely known methodologies to test and improve search term recall and precision.  Used correctly, I think that search terms are still an efficient and defensible means to cull the review universe prior to the application of predictive coding.  I don't think you will hear that very much from the vendor side as it doesn't sell per gigabyte predictive coding charges.</p>
<p>In this case, what is not clear is whether any of these techniques were used and there is not any analysis that the search terms' recall was reasonable in proportion to the additional cost necessary to improve the recall.</p>
<p>I'll let Losey speak as to the lower bounds of acceptable recall for either search terms or predictive coding, but my two cents is that the test should be on the basis of proportionality.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Stratified sampling in e-discovery evaluation by Ralph Losey</title>
		<link>http://blog.codalism.com/?p=1767&#038;cpage=1#comment-275881</link>
		<dc:creator>Ralph Losey</dc:creator>
		<pubDate>Thu, 18 Apr 2013 20:46:23 +0000</pubDate>
		<guid isPermaLink="false">http://blog.codalism.com/?p=1767#comment-275881</guid>
		<description><![CDATA[Well done. I could easily understand it all. And I love the dog examples.]]></description>
		<content:encoded><![CDATA[<p>Well done. I could easily understand it all. And I love the dog examples.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Computer science is not real science by Guest</title>
		<link>http://blog.codalism.com/?p=938&#038;cpage=1#comment-258107</link>
		<dc:creator>Guest</dc:creator>
		<pubDate>Wed, 20 Feb 2013 19:37:10 +0000</pubDate>
		<guid isPermaLink="false">http://blog.codalism.com/?p=938#comment-258107</guid>
		<description><![CDATA[This is a lame discussion, and statement. Definitions serve a purpose. You come up with one definition of science. Has there not been any discussion on the definition of science before somewhere? Couldn&#039;t you refer us to a source of such a discussion? Scientist who start claiming a definition by &quot;but this I claim is what the core activity of the natural sciences has turned out to be&quot; can&#039;t be taken very seriously. We can end this discussion quite simple by putting computer science on a two dimensional continuum with &quot;real science&quot; on the left end, and &quot;no real science&quot; on the other end. You would obviously come to the conclusion that computer science will be somewhat less to the left, than e.g. mathematics or physics. How big of a difference that is can be a point of discussion, but it does not serve any practical purpose. Computer science is a science in the form of a container model (or can be seen as such). Just like Business Administration and others.]]></description>
		<content:encoded><![CDATA[<p>This is a lame discussion, and statement. Definitions serve a purpose. You come up with one definition of science. Has there not been any discussion on the definition of science before somewhere? Couldn't you refer us to a source of such a discussion? Scientist who start claiming a definition by "but this I claim is what the core activity of the natural sciences has turned out to be" can't be taken very seriously. We can end this discussion quite simple by putting computer science on a two dimensional continuum with "real science" on the left end, and "no real science" on the other end. You would obviously come to the conclusion that computer science will be somewhat less to the left, than e.g. mathematics or physics. How big of a difference that is can be a point of discussion, but it does not serve any practical purpose. Computer science is a science in the form of a container model (or can be seen as such). Just like Business Administration and others.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on PhD Thesis by william</title>
		<link>http://blog.codalism.com/?p=1473&#038;cpage=1#comment-248554</link>
		<dc:creator>william</dc:creator>
		<pubDate>Fri, 25 Jan 2013 17:23:22 +0000</pubDate>
		<guid isPermaLink="false">http://blog.codalism.com/?p=1473#comment-248554</guid>
		<description><![CDATA[Ritesh,

Hi!  Sorry for taking such a long time to reply: holidays and deadlines intervened.

Consider the following two uneven, incomplete lists:

S: &lt;a, b, ?, ??, ..&gt;
L: &lt;a, c, b, d, ...&gt;

X_1 is 1, since $$&#124;{a} \cap {a}&#124; = 1$$.  Therefore A_1 is 1 / 1 = 1.  X_2 is also 1, making A_2 be 1 / 2 = 0.5.  What though is X_3?  We know that &quot;b&quot; is a member of {a, b}, but is &quot;?&quot; a member of {a, c, b}?  We don&#039;t know.  Instead, we give it a putative membership of A_2; that is, of 0.5.  Therefore, X_3 is 1 + (1 + 0.5)/2 = 1.75.  And A_3 is 1.75 / 3 = 0.583.  This then works out to the formula shown in Equation 32 (I hope! -- at least it does every time I recreate the working for this formula).]]></description>
		<content:encoded><![CDATA[<p>Ritesh,</p>
<p>Hi!  Sorry for taking such a long time to reply: holidays and deadlines intervened.</p>
<p>Consider the following two uneven, incomplete lists:</p>
<p>S: <a , b, ?, ??, ..><br />
L: </a><a , c, b, d, ...></p>
<p>X_1 is 1, since  <span class='MathJax_Preview'><img src='http://blog.codalism.com/wp-content/plugins/latex/cache/tex_39ae378d916a3a12782f70379dc188be.gif' style='vertical-align: middle; border: none; ' class='tex' alt="|{a} \cap {a}| = 1" /></span><script type='math/tex'>|{a} \cap {a}| = 1</script> .  Therefore A_1 is 1 / 1 = 1.  X_2 is also 1, making A_2 be 1 / 2 = 0.5.  What though is X_3?  We know that "b" is a member of {a, b}, but is "?" a member of {a, c, b}?  We don't know.  Instead, we give it a putative membership of A_2; that is, of 0.5.  Therefore, X_3 is 1 + (1 + 0.5)/2 = 1.75.  And A_3 is 1.75 / 3 = 0.583.  This then works out to the formula shown in Equation 32 (I hope! -- at least it does every time I recreate the working for this formula).</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on PhD Thesis by Ritesh</title>
		<link>http://blog.codalism.com/?p=1473&#038;cpage=1#comment-228324</link>
		<dc:creator>Ritesh</dc:creator>
		<pubDate>Sat, 08 Dec 2012 00:20:51 +0000</pubDate>
		<guid isPermaLink="false">http://blog.codalism.com/?p=1473#comment-228324</guid>
		<description><![CDATA[Hi William,

Thanks. Actually I was able to derive much of the equation. The only part that&#039;s unclear to me is how you compute A_l. The last part of equation 32 is ((X_l - X_s)/l + X_s/s).  As I understand, it is extrapolating average intersection beyond the last element of the longer list. However I am not able to understand how did you get that. By the way I really enjoyed the paper. It was long but really interesting. 

Thanks]]></description>
		<content:encoded><![CDATA[<p>Hi William,</p>
<p>Thanks. Actually I was able to derive much of the equation. The only part that's unclear to me is how you compute A_l. The last part of equation 32 is ((X_l - X_s)/l + X_s/s).  As I understand, it is extrapolating average intersection beyond the last element of the longer list. However I am not able to understand how did you get that. By the way I really enjoyed the paper. It was long but really interesting. </p>
<p>Thanks</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on PhD Thesis by william</title>
		<link>http://blog.codalism.com/?p=1473&#038;cpage=1#comment-228217</link>
		<dc:creator>william</dc:creator>
		<pubDate>Fri, 07 Dec 2012 16:16:32 +0000</pubDate>
		<guid isPermaLink="false">http://blog.codalism.com/?p=1473#comment-228217</guid>
		<description><![CDATA[Ritesh,

Hi!  I wouldn&#039;t try to derive it from Equation 23.  Instead, you should go back to the derivation of Equation 23 from Equation 7 and Equation 9 with X_k / k instead of X_k / d.  The expression $((1 - p) / p) \sum_{d=k+1}^\infty p^d (X_k / k)$ goes simply to $X_k/k \cdot p^k$ using the geometric series.  The rightmost term of Equation 32 is the generalization of $X_k / k$ for uneven lists.  Then the left term is simply summing up the per-ranking weighted overlaps from rank 1 to the depth of the longer list, adjusting for the gap between shorter and longer lists.

Sorry, I don&#039;t have my original working in front of me.  If this is still unclear, I&#039;ll have a go at reworking it over the weekend and put the working up somewhere for you.

William]]></description>
		<content:encoded><![CDATA[<p>Ritesh,</p>
<p>Hi!  I wouldn't try to derive it from Equation 23.  Instead, you should go back to the derivation of Equation 23 from Equation 7 and Equation 9 with X_k / k instead of X_k / d.  The expression $((1 - p) / p) \sum_{d=k+1}^\infty p^d (X_k / k)$ goes simply to $X_k/k \cdot p^k$ using the geometric series.  The rightmost term of Equation 32 is the generalization of $X_k / k$ for uneven lists.  Then the left term is simply summing up the per-ranking weighted overlaps from rank 1 to the depth of the longer list, adjusting for the gap between shorter and longer lists.</p>
<p>Sorry, I don't have my original working in front of me.  If this is still unclear, I'll have a go at reworking it over the weekend and put the working up somewhere for you.</p>
<p>William</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on PhD Thesis by Ritesh</title>
		<link>http://blog.codalism.com/?p=1473&#038;cpage=1#comment-228168</link>
		<dc:creator>Ritesh</dc:creator>
		<pubDate>Fri, 07 Dec 2012 14:41:51 +0000</pubDate>
		<guid isPermaLink="false">http://blog.codalism.com/?p=1473#comment-228168</guid>
		<description><![CDATA[I should have mentioned that I am talking about the RBO paper &quot;A similarity measure for indefinite ranking&quot;]]></description>
		<content:encoded><![CDATA[<p>I should have mentioned that I am talking about the RBO paper "A similarity measure for indefinite ranking"</p>
]]></content:encoded>
	</item>
</channel>
</rss>
