One of the chief pleasures for me of this year’s SIGIR in Beijing was attending the SIGIR 2011 Information Retrieval for E-Discovery Workshop (SIRE 2011). The smaller and more selective the workshop, it often seems, the more focused and interesting the discussion.
How accurate can manual review be?
December 18th, 2011Assessor disagreement and court sanctions
September 4th, 2011I mentioned Cross and Kerksiek’s suggestion of vocabulary discovery in my previous post. Their paper also contains an interesting reference to a case (Felman Products, Inc. v. Industrial Risk Insurers) in which the defendant was penalized for the carelessness of their production. The defendant inadvertently produced privileged documents, and sought to have them ruled inadmissable. Two judges, the original and an appellate, ruled against the defendant, on the grounds that the defendant had not shown sufficient care in their production.
Read the rest of this entry »
Corpus characterization in e-discovery
September 4th, 2011In e-discovery (document retrieval for civil litigation), one side has the documents, the other side proposes the query. This creates an information asymmetry; the requesting side cannot view the corpus to decide what keywords to use and what queries to propose, and opportunities for query iteration are limited, expensive, and liable to being contested.
Read the rest of this entry »
Outlinks
July 21st, 2011Harvard researcher and open-access advocate, Aaron Swartz, faces 35 years’ jail for downloading 4.8 million articles from JSTOR. Still relaxed and comfortable about publishing in closed-access journals?
Correct spelling and grammar more important than positivity or negativity of product reviews — Panos Ipeirotis.
Fitting an elephant with four parameters.
Placebos as effective as real medicine in improving subjectively-measured asthma symptoms, but ineffective in improving objectively-measured symptoms — Science-based medicine.
The Spread of Evidence-Poor Medicine via Flawed Social-Network Analysis. You don’t need to worry about friends catching your obesity—but you might need to worry (even more) about being subjected to interventions based upon poor statistics and faulty peer reviewing.
Multiple significance tests in IR
July 15th, 2011At the most recent TIGER reading group, Mark Sanderson presented Bland and Altman’s introduction to multiple significance tests and the Bonferroni method. The basic point is simple: if you keep trying different experiments, and testing each for significance, then eventually you will find significance by chance, even where no real effect exists. Therefore, if you are performing multiple significance tests, you need to adjust your
values up.
Read the rest of this entry »
PhD Thesis
June 12th, 2011My PhD thesis was passed (actually a few months ago), and I’ve placed it online. The core of the research material has been published elsewhere, but there are a few updates:
Read the rest of this entry »
Outlinks
May 16th, 2011Using Khan academy videos, tests, and monitoring software in schools. Impressive, but is one man the best educationalist on every subject, from maths to history to literature?
Cartoonist illustrates stories after reading only the headlines. Amusing.
Google disbands their search group, forms a knowledge group instead. Should academic information retrieval change focus, too?
Science-Based Medicine is a great blog to follow on medical, statistical, and scientific questions.
Via Panos Ipeirotis, a study finding that paying a small amount produces worse performance than paying nothing at all.
How often should statistical significance occur?
May 8th, 2011Via Andrew Gelman and Howard Wainer, an interesting meta-analysis from 2005 by Pan, Trikalinos, Kavvoura, Lau, and Ioannidis (the last of Why most published research findings are false fame), comparing reported statistical significance and effect sizes in studies of genetic propensity to disease, between studies performed in mainland China and those performed elsewhere. There is a widely-recognised publication bias in all scientific literature towards statistically significant results. In medicine, this frequently plays out as an initial publication that finds statistical significance and a strong effect for a treatment, gene-disease link, or other hypothesis, which is succeeded by follow-up studies that report weaker effects or fail to find significance. The pattern is understandable even if all published studies come from the same meta-population of performed studies: non-significant studies aren’t publishable until a (perhaps by chance) significant study legitimizes the gene-disease link.
Renewing ACM
March 24th, 2011My ACM membership came due just recently. In the light of objections to their copyright policy, I seriously considered not renewing in protest. I agree with Panos that the ACM should not seek copyright from authors, unless the authors are actually being paid for their work. I’m particularly struck by Bob Carpenter’s experience as someone outside academia being asked to pay $20 per article for access to ACM papers—again, papers that the ACM paid no-one for, neither authors nor reviewers. I have recently been helping (as a return favour) a small local IT firm improve the text analysis features of one of their products, which has primarily involved finding and summarizing robust methods from the academic literature; having to pay for each article would have made the process not only expensive but also much more frustrating. One of the key roles of publicly-funded research is surely to make research findings readily accessible to independent entrepreneurs and innovators; placing fee-walls in front of research publications frustrates this aim.
Authorities
March 9th, 2011“Semantic search log analysis” by Hollink, Tsikrika, and de Vries. Categorize the semantic classes of query reformulations. For a professional image search service, the most common non-identity reformulation is to find the spouse of the first query’s result.
A call to boycott ACM and IEEE program committees and journal reviewing until they allow free distribution of final paper versions. I sympathize, but it’s selfish to boycott reviewing if you’re not boycotting submitting, and there are no IR venues of repute that allow free paper distribution.
Progress in e-discovery technology making lawyers redundant; and similar progress in linguistic technology to hollow out other professions. This report has gone viral, with Paul Krugman using it as the basis for an opinion piece on the shrinking employment utility of mass university education.
The Australian Academy of Science reviews the evidence for climate change. The “Trends in Australia Rainfall” graph (Figure 3.6) is particularly disheartening.
A laptop with a built-in eye tracker.
HTML5+CSS are Turing complete.
