Difference between revisions of "Improving Knowledge-Based Weakly Supervised Information Extraction"

From Cohen Courses
Jump to navigationJump to search
(Created page with '== Team Members == * Wangshu Pang * Yun Wang == Problem Statement == In many Wikipedia pages, there is an "infobox" that contains facts about t…')
 
Line 4: Line 4:
 
* [[User:yunwang|Yun Wang]]
 
* [[User:yunwang|Yun Wang]]
  
== Problem Statement ==
+
== Project Idea ==
  
 
In many Wikipedia pages, there is an "infobox" that contains facts about the described subject, summarized concisely as attribute-value pairs. These infoboxes contains structured information and can be useful for many applications. Infoboxes are generated with templates, and there are different templates for different types of pages, such as "person", "company", "book", etc, each with a different set of attributes. Unfortunately, not all infoboxes have complete information about the subject being described. For example, a page about a music album may have the "artist" attribute, but lack the "published year" attribute.
 
In many Wikipedia pages, there is an "infobox" that contains facts about the described subject, summarized concisely as attribute-value pairs. These infoboxes contains structured information and can be useful for many applications. Infoboxes are generated with templates, and there are different templates for different types of pages, such as "person", "company", "book", etc, each with a different set of attributes. Unfortunately, not all infoboxes have complete information about the subject being described. For example, a page about a music album may have the "artist" attribute, but lack the "published year" attribute.

Revision as of 23:01, 12 September 2011

Team Members

Project Idea

In many Wikipedia pages, there is an "infobox" that contains facts about the described subject, summarized concisely as attribute-value pairs. These infoboxes contains structured information and can be useful for many applications. Infoboxes are generated with templates, and there are different templates for different types of pages, such as "person", "company", "book", etc, each with a different set of attributes. Unfortunately, not all infoboxes have complete information about the subject being described. For example, a page about a music album may have the "artist" attribute, but lack the "published year" attribute.

There exists some work that tries to fill in the missing attributes for infoboxes, using the unstructured text of the Wikipedia articles. For example, iPopulator [1], which is based on conditional random fields, achieves a precision of 91% and a recall of 66%. In this paper, the authors used 3-fold cross validation, i.e. using two thirds of the total data for training and one third for evaluation.

In our project, we would like to see if it is possible to achieve comparable performance with less labeled training data. We'll employ semi-supervised training, i.e. training with a small amount of labeled data (pages with infoboxes) and a large amount of unlabeled data (pages without infoboxes). The system will iteratively generate infoboxes for the unlabeled pages, and include some pages with high confidence into the labeled training set.

References

[1] D. Lange, C. Böhm, F. Naumann, "Extracting Structured Information from Wikipedia Articles to Populate Infoboxes", CIKM, Oct 2010.