By: william

william — Sun, 09 Jun 2013 21:33:20 +0000

Ralph,

Hi! Yes, I think that the sort of human-directed search facilities that Inview offers are an important part of the solution to the low-prevalence case. The comparative experiment you ran on the EDRM dataset was an interesting exploration of that. What would have happened in this situation if you were stuck in the pure-Borg approach -- even the seed set had to be a simple random sample? Prevalence there was, what, one in a thousand? With some Borg approaches you'd need to sample 60,000-odd documents just to get your control set!

I'm interested too in your comment on the importance of the trainer knowing what documents do and do not make good training examples. In principle, the classifier should be able to manage this automatically -- but apparently not! For instance, if long documents _always_ cause problems, then the system should be pre-configured not to train on them. Or cross-validation experiments should be able to determine that certain classes of training documents are not helpful -- indeed, in principle it seems to me that the classifier should be able to note this directly, at least if given the correct features. Anyway, as always, interesting material for future research!

By: Ralph Losey

Ralph Losey — Thu, 06 Jun 2013 17:53:45 +0000

In my experience the commercial software I have been using, Inview, works quite well to overcome the difficulties you describe, including initial bias (a term I dont like because it is contra the whole idea of SME), and the low number of positive examples as compared to high number of irrelevant documents. It takes several rounds of training. Sometimes a round can be quite fast and only involve a few corrections. Other times you need to feed in may corrections.

To be successful at this you have to develop a knack for identifying the types of documents that may impede or further the active learning process. That is often case specific, but can also be general, at least for the software you are using. For instance, for the software I use I have learned to remove very large text files form the initial training. I have found they are too confusing to the classifier. I use other methods to search these files and or add them in at the end of a training process.

There are other types of files that should be removed as well, plus still others, usually case specific, that are especially helpful to train on. The whole process becomes more art than science, and has to do with learning the peculiarities of the software in general, and on particular projects.

I guess what I am saying is that effective active machine learning is a two-way process requiring a searcher to pay close attention to the impact of their actions on the automated classifications.

I would be interested to hear your reaction to this, and other readers of your blog? Are others having the same or similar experiences? Has their been any research on this? Literature on the subject? I am a lawyer, not a scientist, so please do not assume I know the basic writings available in the field. Thanks.

Comments on: Does automatic text classification work for low-prevalence topics?

By: william

By: Ralph Losey