<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Ad-hoc retrieval: measurably going nowhere</title>
	<atom:link href="http://blog.codalism.com/?feed=rss2&#038;p=1029" rel="self" type="application/rss+xml" />
	<link>http://blog.codalism.com/?p=1029</link>
	<description>William Webber's Research Blog</description>
	<lastBuildDate>Fri, 27 Aug 2010 02:57:11 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.1</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: SIGIR: Research vs. Reality &#171; MobBlog</title>
		<link>http://blog.codalism.com/?p=1029&#038;cpage=1#comment-12389</link>
		<dc:creator>SIGIR: Research vs. Reality &#171; MobBlog</dc:creator>
		<pubDate>Mon, 26 Jul 2010 10:56:16 +0000</pubDate>
		<guid isPermaLink="false">http://blog.codalism.com/?p=1029#comment-12389</guid>
		<description>[...] by x%&#8221;) rather than look at novel problems (see #1 above). However, while doing so, there is no evidence of long-term, cumulative progress in decades of publications. On the other hand, I continue to miss [...]</description>
		<content:encoded><![CDATA[<p>[...] by x%&#8221;) rather than look at novel problems (see #1 above). However, while doing so, there is no evidence of long-term, cumulative progress in decades of publications. On the other hand, I continue to miss [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: FXPAL Blog &#187; Blog Archive &#187; Maintaining relevance</title>
		<link>http://blog.codalism.com/?p=1029&#038;cpage=1#comment-12142</link>
		<dc:creator>FXPAL Blog &#187; Blog Archive &#187; Maintaining relevance</dc:creator>
		<pubDate>Thu, 22 Jul 2010 07:04:40 +0000</pubDate>
		<guid isPermaLink="false">http://blog.codalism.com/?p=1029#comment-12142</guid>
		<description>[...] do complex work, the quality of that work can be measured, and progress made. (Of course sometimes progress isn&#8217;t cumulative, but that&#8217;s a different [...]</description>
		<content:encoded><![CDATA[<p>[...] do complex work, the quality of that work can be measured, and progress made. (Of course sometimes progress isn&#8217;t cumulative, but that&#8217;s a different [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Le Zhao</title>
		<link>http://blog.codalism.com/?p=1029&#038;cpage=1#comment-7646</link>
		<dc:creator>Le Zhao</dc:creator>
		<pubDate>Tue, 16 Mar 2010 21:53:33 +0000</pubDate>
		<guid isPermaLink="false">http://blog.codalism.com/?p=1029#comment-7646</guid>
		<description>Interesting paper &amp; discussion.

There are actually two questions relevant to the discussion,

1. How should a baseline be chosen,

A baseline is to help the experimenter to understand what worked, and why it worked.  So I believe all baselines should be customized to the particular task under study.  E.g. all ad hoc retrieval models should be compared with BM25, Dirichlet LM; all phrase models should be compared with dependency model; all pseudo relevance feedback models should be compared with Rocchio and relevance model.

Now come the second question,
2. what if the guideline is not followed, or how to ensure that standard.

Well, it&#039;s really a moral question, with practical means to ensure the quality of papers.

Generally it&#039;s the authors&#039; responsibility to make an effort to create a reasonable baseline, and it&#039;s the reviewers&#039; responsibility to assess whether that result is believable, and whether it will work for the state of the art.

But there will always be cases where such guidelines are not followed, the paper *can* still be accepted if its contribution is believed to be significant by the reviewer.  However, I believe the proportion of such work in conferences should be kept low.</description>
		<content:encoded><![CDATA[<p>Interesting paper &amp; discussion.</p>
<p>There are actually two questions relevant to the discussion,</p>
<p>1. How should a baseline be chosen,</p>
<p>A baseline is to help the experimenter to understand what worked, and why it worked.  So I believe all baselines should be customized to the particular task under study.  E.g. all ad hoc retrieval models should be compared with BM25, Dirichlet LM; all phrase models should be compared with dependency model; all pseudo relevance feedback models should be compared with Rocchio and relevance model.</p>
<p>Now come the second question,<br />
2. what if the guideline is not followed, or how to ensure that standard.</p>
<p>Well, it&#8217;s really a moral question, with practical means to ensure the quality of papers.</p>
<p>Generally it&#8217;s the authors&#8217; responsibility to make an effort to create a reasonable baseline, and it&#8217;s the reviewers&#8217; responsibility to assess whether that result is believable, and whether it will work for the state of the art.</p>
<p>But there will always be cases where such guidelines are not followed, the paper *can* still be accepted if its contribution is believed to be significant by the reviewer.  However, I believe the proportion of such work in conferences should be kept low.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: FXPAL Blog &#187; Blog Archive &#187; Lack of progress as an opportunity for progress</title>
		<link>http://blog.codalism.com/?p=1029&#038;cpage=1#comment-3169</link>
		<dc:creator>FXPAL Blog &#187; Blog Archive &#187; Lack of progress as an opportunity for progress</dc:creator>
		<pubDate>Thu, 01 Oct 2009 15:01:25 +0000</pubDate>
		<guid isPermaLink="false">http://blog.codalism.com/?p=1029#comment-3169</guid>
		<description>[...] demonstrating long-term gains in system performance. This interesting analysis is summarized in a blog post by William [...]</description>
		<content:encoded><![CDATA[<p>[...] demonstrating long-term gains in system performance. This interesting analysis is summarized in a blog post by William [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sérgio Nunes</title>
		<link>http://blog.codalism.com/?p=1029&#038;cpage=1#comment-3154</link>
		<dc:creator>Sérgio Nunes</dc:creator>
		<pubDate>Wed, 30 Sep 2009 23:08:41 +0000</pubDate>
		<guid isPermaLink="false">http://blog.codalism.com/?p=1029#comment-3154</guid>
		<description>Very interesting post and discussion.

I tend to view TREC as an &quot;evaluation platform&quot; for researchers not as a competition. However, I agree that a global perspective on the progress over the years is needed - maybe NIST could play a role in this...</description>
		<content:encoded><![CDATA[<p>Very interesting post and discussion.</p>
<p>I tend to view TREC as an &#8220;evaluation platform&#8221; for researchers not as a competition. However, I agree that a global perspective on the progress over the years is needed &#8211; maybe NIST could play a role in this&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jon</title>
		<link>http://blog.codalism.com/?p=1029&#038;cpage=1#comment-3125</link>
		<dc:creator>Jon</dc:creator>
		<pubDate>Tue, 29 Sep 2009 15:50:47 +0000</pubDate>
		<guid isPermaLink="false">http://blog.codalism.com/?p=1029#comment-3125</guid>
		<description>s/Ferdinand/Fernando/g</description>
		<content:encoded><![CDATA[<p>s/Ferdinand/Fernando/g</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: william</title>
		<link>http://blog.codalism.com/?p=1029&#038;cpage=1#comment-3094</link>
		<dc:creator>william</dc:creator>
		<pubDate>Tue, 29 Sep 2009 02:55:46 +0000</pubDate>
		<guid isPermaLink="false">http://blog.codalism.com/?p=1029#comment-3094</guid>
		<description>On Jon and &lt;del&gt;Ferdinand&#039;s&lt;/del&gt; Fernando&#039;s point on why the best TREC systems are not reasonable baselines, because of the high degree of manually tuning etc. that goes into them:

This is the really important question that arises out of the paper.  What is a reasonable baseline?  Similarly, how much effort should researchers have to put into improving their baseline systems, before they implement and test their innovation?

The argument we make in the paper is to note that a method which improves some vanilla system might not improve, or indeed might even harm, a state-of-the-art one.  In the paper, we do a proof-of-concept experiment to demonstrate this, based on toggling options for Indri.  Therefore, if the experimenter&#039;s claim is that they are presenting a method that improves retrieval performance over what is currently known, then they have an obligation to test it on state-of-the-art systems.  Or, alternatively (and this I think is the really superior method), they need to test their method on a number of different baselines, to demonstrate how generally beneficial it is; but here too, their mix of systems must include ones that implement state-of-the-art techniques.  These baselines should, for instance, include methods such as query expansion and proximity operators, which have been shown to lead to improved effectiveness, at least on some collections.  It seems that instead what people tend to do is to take something like plain BM25 as their baseline.

It could also be noted that while the best TREC runs may be subject to extensive manual tuning, they do face several disadvantages compared to subsequent experimental runs.  First, TREC systems don&#039;t have access to the topics in advance, whereas subsequent runs do.  (Yes, I know that direct use should not be made of this advantage, but I believe that it does infer an indirect advantage).  Second, TREC systems don&#039;t have the knowledge of which methods have actually worked on a particular collection, which subsequent systems do have.

All of that said, I&#039;m prepared to grant that the best TREC system may not always be an appropriate baseline.  But if so, shouldn&#039;t the best (or at least an improved) previously published method be the correct baseline choice?  The fact that neither baselines nor &quot;improved&quot; scores trend up over time indicates that this is not the experimental approach that has been adopted -- and that whatever approach it is that has been adopted, that approach is not leading to an improvement in effectiveness.</description>
		<content:encoded><![CDATA[<p>On Jon and <del>Ferdinand&#8217;s</del> Fernando&#8217;s point on why the best TREC systems are not reasonable baselines, because of the high degree of manually tuning etc. that goes into them:</p>
<p>This is the really important question that arises out of the paper.  What is a reasonable baseline?  Similarly, how much effort should researchers have to put into improving their baseline systems, before they implement and test their innovation?</p>
<p>The argument we make in the paper is to note that a method which improves some vanilla system might not improve, or indeed might even harm, a state-of-the-art one.  In the paper, we do a proof-of-concept experiment to demonstrate this, based on toggling options for Indri.  Therefore, if the experimenter&#8217;s claim is that they are presenting a method that improves retrieval performance over what is currently known, then they have an obligation to test it on state-of-the-art systems.  Or, alternatively (and this I think is the really superior method), they need to test their method on a number of different baselines, to demonstrate how generally beneficial it is; but here too, their mix of systems must include ones that implement state-of-the-art techniques.  These baselines should, for instance, include methods such as query expansion and proximity operators, which have been shown to lead to improved effectiveness, at least on some collections.  It seems that instead what people tend to do is to take something like plain BM25 as their baseline.</p>
<p>It could also be noted that while the best TREC runs may be subject to extensive manual tuning, they do face several disadvantages compared to subsequent experimental runs.  First, TREC systems don&#8217;t have access to the topics in advance, whereas subsequent runs do.  (Yes, I know that direct use should not be made of this advantage, but I believe that it does infer an indirect advantage).  Second, TREC systems don&#8217;t have the knowledge of which methods have actually worked on a particular collection, which subsequent systems do have.</p>
<p>All of that said, I&#8217;m prepared to grant that the best TREC system may not always be an appropriate baseline.  But if so, shouldn&#8217;t the best (or at least an improved) previously published method be the correct baseline choice?  The fact that neither baselines nor &#8220;improved&#8221; scores trend up over time indicates that this is not the experimental approach that has been adopted &#8212; and that whatever approach it is that has been adopted, that approach is not leading to an improvement in effectiveness.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: william</title>
		<link>http://blog.codalism.com/?p=1029&#038;cpage=1#comment-3092</link>
		<dc:creator>william</dc:creator>
		<pubDate>Tue, 29 Sep 2009 02:35:00 +0000</pubDate>
		<guid isPermaLink="false">http://blog.codalism.com/?p=1029#comment-3092</guid>
		<description>On &lt;del&gt;Ferdinand&#039;s&lt;/del&gt; Fernando&#039;s point 3, the use of EvaluatIR:

We certainly take the results observed here as an argument for using EvaluatIR or a similar service that centrally records and reports retrieval results.  We face the practical issue now of whether and how to incorporate results reported in previous publications, for which we don&#039;t have per-topic scores, let alone the runs themselves.

For the TREC runs, no, it is not possible to download them from EvaluatIR; they can only be obtained from NIST based upon signing an agreement regarding their use.  We have also, at NIST&#039;s request, disabled serving out per-topic scores for TREC systems on EvaluatIR, although calculations based on these scores (such as significance tests) are available, and per-topic scores are still shown on graphs.  The issue is with permissions of use; we&#039;d have to individually ask the groups that submitting to TREC whether we can make their runs available, and we haven&#039;t gotten around to doing that.

That said, we do use EvaluatIR ourselves extensively simply as a convenient way of looking up the (per-collection) scores achieved on different collections by TREC systems.</description>
		<content:encoded><![CDATA[<p>On <del>Ferdinand&#8217;s</del> Fernando&#8217;s point 3, the use of EvaluatIR:</p>
<p>We certainly take the results observed here as an argument for using EvaluatIR or a similar service that centrally records and reports retrieval results.  We face the practical issue now of whether and how to incorporate results reported in previous publications, for which we don&#8217;t have per-topic scores, let alone the runs themselves.</p>
<p>For the TREC runs, no, it is not possible to download them from EvaluatIR; they can only be obtained from NIST based upon signing an agreement regarding their use.  We have also, at NIST&#8217;s request, disabled serving out per-topic scores for TREC systems on EvaluatIR, although calculations based on these scores (such as significance tests) are available, and per-topic scores are still shown on graphs.  The issue is with permissions of use; we&#8217;d have to individually ask the groups that submitting to TREC whether we can make their runs available, and we haven&#8217;t gotten around to doing that.</p>
<p>That said, we do use EvaluatIR ourselves extensively simply as a convenient way of looking up the (per-collection) scores achieved on different collections by TREC systems.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: william</title>
		<link>http://blog.codalism.com/?p=1029&#038;cpage=1#comment-3090</link>
		<dc:creator>william</dc:creator>
		<pubDate>Tue, 29 Sep 2009 02:18:00 +0000</pubDate>
		<guid isPermaLink="false">http://blog.codalism.com/?p=1029#comment-3090</guid>
		<description>On &lt;del&gt;Ferdinand&#039;s&lt;/del&gt; Fernando&#039;s point 1, about variants on the standard TREC test collections:

Yes, we were careful to note when researchers were using variant collections, and we recorded them separately.  One of the amusing second-order findings of the survey was the enormous number of different variant collections that were in fact used.  In the 106 papers that met our survey criteria, 83 different variant TREC collections were used -- almost as many collections as papers!  We discuss this in the second-last paragraph of Section 3; see also Figure 2.  Most of the variants were slicing and dicing the TIPSTER corpus and associated topic sets; in some cases, to increase the number of topics (which required excluding parts of the document corpus that were not shared between topic sets); in other cases, to reduce the document corpus size to make computationally intensive methods practical.</description>
		<content:encoded><![CDATA[<p>On <del>Ferdinand&#8217;s</del> Fernando&#8217;s point 1, about variants on the standard TREC test collections:</p>
<p>Yes, we were careful to note when researchers were using variant collections, and we recorded them separately.  One of the amusing second-order findings of the survey was the enormous number of different variant collections that were in fact used.  In the 106 papers that met our survey criteria, 83 different variant TREC collections were used &#8212; almost as many collections as papers!  We discuss this in the second-last paragraph of Section 3; see also Figure 2.  Most of the variants were slicing and dicing the TIPSTER corpus and associated topic sets; in some cases, to increase the number of topics (which required excluding parts of the document corpus that were not shared between topic sets); in other cases, to reduce the document corpus size to make computationally intensive methods practical.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: william</title>
		<link>http://blog.codalism.com/?p=1029&#038;cpage=1#comment-3089</link>
		<dc:creator>william</dc:creator>
		<pubDate>Tue, 29 Sep 2009 02:10:16 +0000</pubDate>
		<guid isPermaLink="false">http://blog.codalism.com/?p=1029#comment-3089</guid>
		<description>Thanks to all for the comments, which are much appreciated; with the sort of meta-analysis that we are attempting in this paper, understanding how others both interpret the analysis from above, and perceive the individual inputs from below, is very valuable.  I&#039;ll respond in parts below.</description>
		<content:encoded><![CDATA[<p>Thanks to all for the comments, which are much appreciated; with the sort of meta-analysis that we are attempting in this paper, understanding how others both interpret the analysis from above, and perceive the individual inputs from below, is very valuable.  I&#8217;ll respond in parts below.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
