I enjoy other bloggers' link aggregations, so I've decided to do my own.

Ben Edelman demonstrates that Google biases its search results towards its own services, even when other, lower-ranked links get more click-throughs -- BenEdelman.

Strike one for pessimists! Ben Carterette and Ian Soboroff (SIGIR 2010; yes, six months old, but only just read) find that optimistic assessors do much more harm to evaluation reliability than pessimistic assessors do -- doi and free version.

The University of Melbourne logo appears on umbrellas, beanies, scarves, water bottles, shorts, and teddy bears, but you're strictly not allowed to put it on your PhD thesis, lest you cheapen the brand with scholarly associations (thanks to Alex Stivala for the tip).

A memorable term from Paul Kerdosky: data exhaust, "the unintended information we throw off in our daily activities".

Also from Paul: algorithmic search is failing, and curation is going to be the new search.

Daniel Lemire argues for the re-introduction of Athenian-style demarchy -- rule by randomness. (We still do juries this way, mind you...)

Really nice word association visualizations from Chris Harrison (thanks to Bob Carpenter).

The translation of Gelman and Hill's "Data Analysis Using Regression and Multilevel/Hierarchical Models" rejected by the Chinese censor as too politically sensitive.

One Response to “Spokes”

  1. Just to clarify the point about Gelman and Hill's book -- it was the publisher who decided to pull the plug on the international additional, not the authorities. You can buy their book in China in English already!

    I love the term "data exhaust" -- thanks for linking it.

    As to the curation being the new search, I really think that for specialized applications, I can't see curation keeping up. But, a little curation goes a long way if you can leverage it with unsupervised data. That's the focus of a lot of machine learning research, so I'm probably just speaking as a sheep here.

    I see the use of more computer power for specialized searches getting more popular. Things like semi-supervised bootstrapping (say a kind of supervised relevance feedback) or even focused ontologies applied to subests of documents, along with clustering or other approaches to improve result diversity and improve the ability to focus on relevant veins of information. I'm finding this latter problem my biggest bottleneck now. For instance, I'm looking for matrix software that's often written in C or Fortran, and I only really want the information on C. But many of those are ported from Fortran, so I can't just add [-fortran] to my query. My wife often looks for genes, but is only interested in the nematode version, not the mouse or mustard grass homolog. Some of the data is hand-faceted, but that's very error prone and only very partial.

Leave a Reply