Improving Knowledge-Based Weakly Supervised Information Extraction

From Cohen Courses
Revision as of 15:59, 22 September 2011 by Wcohen (talk | contribs) (→‎References)
Jump to navigationJump to search

Team Members

Project Idea

In many Wikipedia pages, there is an "infobox" that contains facts about the described subject, summarized concisely as attribute-value pairs. These infoboxes contains structured information and can be useful for many applications. Infoboxes are generated with templates, and there are different templates for different types of pages, such as "person", "company", "book", etc, each with a different set of attributes. Unfortunately, not all infoboxes have complete information about the subject being described. For example, a page about a music album may have the "artist" attribute, but lack the "published year" attribute.

There exists some work that tries to fill in the missing attributes for infoboxes, using the unstructured text of the Wikipedia articles. For example, iPopulator [1], which is based on conditional random fields, achieves a precision of 91% and a recall of 66%. In this paper, the authors used 3-fold cross validation, i.e. using two thirds of the total data for training and one third for evaluation.

In our project, we would like to see if it is possible to achieve comparable performance with less labeled training data. We'll employ semi-supervised training, i.e. training with a small amount of labeled data (pages with infoboxes) and a large amount of unlabeled data (pages without infoboxes). The system will iteratively generate infoboxes for the unlabeled pages, and include some pages with high confidence into the labeled training set.

References

[1] D. Lange, C. Böhm, F. Naumann, "Extracting Structured Information from Wikipedia Articles to Populate Infoboxes", CIKM, Oct 2010.

Comments from William =

This is a nice problem. Semi-supervised learning won't be covered will later in the class, though, so you guys will have to be proactive about finding the appropriate papers for this. One nice paper that might get you started is: http://dl.acm.org/citation.cfm?id=1870675

You guys should also look into the Wu and Weld papers on Infobox extraction, which are quite nice.

--Wcohen 20:59, 22 September 2011 (UTC)