Sgardine writesup Bunescu 2007

From Cohen Courses
Revision as of 10:42, 3 September 2010 by WikiAdmin (talk | contribs) (1 revision)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

This is a review of Bunescu_2007_learning_to_extract_relations_from_the_web_using_minimal_supervision by user:Sgardine.

Summary

Given a set of known named entities, their memberships in a relation R, and a large corpus (e.g. the web) of sentences concerning the entities, we wish to learn a model which can decide whether a given sentence (involving entities e1 and e2) does or does not imply that R(e1,e2). Multiple Instance Learning (MIL) specifically examines the problems of ML when negative examples are assumed to be negative but not all instances in the positive bag are in fact positive. Standard MIL has a large number of small bags; here we have the opposite. Thus we follow previous work and simply train as if all instances in positive bags were positive. The subsequence kernel from Bunescu_2006_subsequence_kernels_for_relation_extraction is used with a modification for stopwords and punctuation. The kernel is further modified to downweight words which co-occur with the entities rather than with the relation per se. The model outperforms a bag-of-words model; the model performs best with the word-weighted kernel and is competitive with a model that has full access to (expensive) positive/negative labels at the sentence level.

Commentary

I like the idea of trying ML over noisy data and trying to characterize how the noise affects the results.