Siddharth writeup of Bunescu and Mooney
This is a review of Bunescu_2007_learning_to_extract_relations_from_the_web_using_minimal_supervision by user:sgopal1.
The aim of the paper is to present a relation extraction method given a few positive and negative examples for a given relation. They motivate the problem as a MIL problem and then reformulate as a standard optimization problem. They use a subsequence kernel to define the similarity between two sentences ( the kernel is defined on the words rather than letters ). They discuss two types of bias. The first one is related to words co-occuring with the elements of the relation. The second one is correlated with the relation itself. Well I would'nt actually call these as bias may be they are indicative of a relationship or probably we simply dint have a good seed list. The first type of bias is eliminated by having a decreasing the word-weights based on linear correlation ( What happens when the relation is not linear .. http://en.wikipedia.org/wiki/File:Anscombe.svg , all the pictures have the same correlation ). The second type of bias is eliminated by adding placeholders.
The use of google seems atrocious ( they should've changed the introduction to " and present experimental results demonstrating that our approach can reliably extract relations from 'google results' ). How can we be sure that the kind of techniques used by google does/does not encourage relations to be present. Google could rank pages where the focus of the webpage is particularly on the query string, so in that case you would not get an unbiased sample of sentences. I'm not sure whether using Google is justified.