Reverted indexes considered further

Gene and Jeremy have disputed two assertions about index reversion made in my previous post. The first is that (1,10) normalization strips term occurences of their IDF factor. The second is that the cutoff depth used in creating the reverted corpus sets an upper bound on reverted term DF.

Let me say immediately that my second assertion is wrong, and was due to confusion on my part: cutoff depth constrains reverted document length, not reverted DF; the latter depends (as Gene points out) on how many terms a document is highly retrievable by.

On the first point, I only concede partially. Gene observes (and Jeremy makes a similar point):

In general, basis queries that have a high probability/score for retrieving a particular docid are preferred. But if those basis queries retrieve a lot of other documents as well, the relative contribution of that score is lowered (due to a higher reverted document length).

The number of documents returned by a basis query, though, is limited to the cutoff. The IDF factor of very rare terms is retained, but there is no distinction in IDF between moderately rare (1000 < n < 10000) and frequent terms.

If we put both the cutoff selection effect I mention in my post, and the DF effect Gene and Jeremy describe above, then we get a bipartite retention of the original DF factor, one whose behaviour is dependent on the cutoff depth n. Terms with a DF less than n have their DF factor carried through in the reverted document length, making the term seem less exclusively "about" any given document it retrieves; terms with a DF more than n have their DF factor retained by a decreasing probability that they will be included at all in the reverted index for documents they occur in.

Gene also elaborates on what IDF means in the reverted index. Reverted IDF, roughly speaking, penalizes longer documents in the selection of expansion terms; and, as Gene observes, long documents are likely to be less focused. This effect does indeed seem calculated to improve the effectiveness of the expansion, although whether it would lead directly to the use of more selective terms is less clear.

I'm going to start tying myself in knots if I go too much further with this. Clearly, feedback using a reverted index does favour more selective expansion terms, and empirical results show that this leads to more effective retrieval (as measured on the TREC AdHoc collections -- whether result diversity would be harmed is another question). I would, though, like to see a more mathemetical analysis of what happens to term occurence weights as they go through the reversion process. The goal should be to produce an expansion term selection and weighting scheme that could be applied (however inefficiently) without having to perform the manipulations of reversion. Perhaps a starting point would be to apply no cutoff to the retrieval used to create the reverted index -- that is, convert each term's full inverted list into a reverted document -- and consider what reversion does to scores in this case.

9 Responses to “Reverted indexes considered further”

  1. jeremy says:

    The IDF factor of very rare terms is retained, but there is no distinction in IDF between moderately rare (1000 < n < 10000) and frequent terms.

    True. But some research suggests that the value (for selecting relevant documents) of IDF is flat in both the rare and frequent tails. See this tech report from Warren Greiff, back in 1999, in particular Figure 2b:

    It's not like you lose DF information altogether.. you still know that particular terms are ">= 1000" terms. If anything, IDF maxes out (technically, "mins" out -- hits a floor when it really should go lower) by having this 1000 cutoff.. which is something that favors more common terms, would it not? It would give them IDF scores that are larger (well, technically, reverted index doclength values that are smaller) than they deserve. And yet those common terms are still not retrieved as much by the reverted queries.

    Whether 1000 is the correct inflection point on the Greiff's curve or not, the place where IDF truly starts to flatten out, I don't know. That's an empirical question, and one that is obviously collection-dependent as well. It's certainly worth testing.

  2. jeremy says:

    See also this paper, which won SIGIR best paper in 1998:

    In particular, see Figures 5 and 6, and the accompanying sections of text (around PDF page 5).

  3. jeremy says:

    And I do share your desire for the expression of these concepts in a more mathematically rigorous way.

    But one difficulty in doing so is that many of these concepts that we're batting around (IDF, DL, TF, etc.) are rough approximations of what is actually happening in a retrieval model.

    Remember, there are two main components to the reverted index: (1) The retrieval algorithm used to construct the reverted index (ranking function for the basis queries), and (2) the retrieval algorithm used to retrieve *from* (issue queries on) the reverted index.

    In the paper, for a number of reasons, we keep those two algorithms the same: PL2, which is a divergence from randomness model. But there is no reason why they had to be the same. One could use PL2 to construct the index, and BM25 to retrieve from the index. Or Cosine similarity to construct the index, and language modeling to retrieve from the index. Depending on what retrieval models you use, as what stages, you're going to get different terms and weights.

    (Aside: In fact, in an earlier, rejected version of the paper we indeed did do some of those experiments, in a relevance feedback context. We had to take them out, though, because the reviewers were almost unanimous in their disinterest of that particular avenue of research, and instead wanted to see PseudoRelevance Feedback experiments. Hopefully in future work we'll put them back in.)

    But still, when you realize the possibility of a mixture of approaches.. a vector space model to construct the index, and a dirichlet-smoothed language model to retrieve from the index, it becomes very difficult to imagine a closed-form mathematical expression for feedback term weights. You can't just insert vector space mathematics into a language model, and then solve for x. Or insert a divergence from randomness model into an Okapi Best Match model, and then solve for x. Can you?

    (Written text is difficult.. my "can you?" is not sarcasm. It's honest head-scratching.)

  4. jeremy says:

    BTW, I thoroughly enjoy this discussion. I wish there were more time for these sorts of things at the conference itself.

  5. william says:

    I'm impressed by your knowledge of the term weighting literature! The Greiff papers are interesting, and the effect of flat IDF impact at lower and higher DF values makes some intuitive sense. In any case, what the reversion is doing clearly works. I think we've teased out some suggestions as to why it works. It would be interesting to test some of them out at some stage, through a combination of thought experiment, mathematical analysis, and empirical results.

    Iadh (or was it Craig?) made a comment at CIKM that suggested you had fed your reversion code back into the Terrier trunk. Is this the case?

    I'm amused and a little disappointed by the reviewers' Pavlovian demand for you to try pseudo-relevance feedback. Someone should put together an IR reviewers' cheat-sheet of reflexive reactions to non-standard ideas.

  6. Craig says:

    Hi William & Jeremy,

    I don't think I did make that comment, but I would be interested in feeding any reverted indexing implementation back into Terrier trunk!

  7. jeremy says:

    Iadh and I discussed feeding it back into the trunk, in the future. But no, it hasn't been done so, yet.

  8. jeremy says:

    I’m amused and a little disappointed by the reviewers’ Pavlovian demand for you to try pseudo-relevance feedback. Someone should put together an IR reviewers’ cheat-sheet of reflexive reactions to non-standard ideas.

    The thing is, ten years ago I don't think the reviewers would have reacted this way. I'm not saying they automatically would or wouldn't have liked the paper on its own merits. But the demand for PRF as a necessary evaluation metric.. that wouldn't have been there ten years ago.

    In a way, I think the reason for this is not unrelated to your other post from a few days ago on eDiscovery, especially the bit about it being, oh how did you say it exactly, "E-discovery, then, is like the mediated, formalized, “serious” information retrieval of the good old days writ large. And besides it attractions to the nostalgic, e-discovery is a very sympathetic field for investigations into retrieval evaluation."

    I think that's exactly right. And relevance feedback, as opposed to pseudo relevance feedback, is more what eDiscovery is all about. Users are actually willing to look past the first 1 or 3 or even 5 results. Users are willing to look at hundreds or thousands of results. And that's when relevance feedback becomes of utmost important. Not PRF. Relevance feedback is what the formalized, serious, mediated, good old days retrieval was all about.

    I'm about ten years too late. Or right on time. Or something.

  9. [...] to terms given by traditional relevance feedback methods. After the conference, William Weber raised a point on why this is the case, resulting in a thought-provoking [...]

Leave a Reply