Difference between revisions of "Improving Knowledge-Based Weakly Supervised Information Extraction"

From Cohen Courses
Jump to navigationJump to search
 
(6 intermediate revisions by 2 users not shown)
Line 4: Line 4:
 
* [[User:wpang|Wangshu Pang]]
 
* [[User:wpang|Wangshu Pang]]
 
* [[User:yunwang|Yun Wang]]
 
* [[User:yunwang|Yun Wang]]
 +
* [[User:Mg1|Matt Gardner]]
  
 
== Project Idea ==
 
== Project Idea ==
  
In many Wikipedia pages, there is an "infobox" that contains facts about the described subject, summarized concisely as attribute-value pairs. These infoboxes contains structured information and can be useful for many applications. Infoboxes are generated with templates, and there are different templates for different types of pages, such as "person", "company", "book", etc, each with a different set of attributes. Unfortunately, not all infoboxes have complete information about the subject being described. For example, a page about a music album may have the "artist" attribute, but lack the "published year" attribute.
+
Our first idea was to better populate wikipedia infoboxes using semi-supervised techniques.  After looking a little more closely at that problem, we discovered that it was somewhat ill-posed, and that producing a baseline would be difficult, as recent systems that address that problem do not have freely available code.
  
There exists some work that tries to fill in the missing attributes for infoboxes, using the unstructured text of the Wikipedia articles. For example, iPopulator [1], which is based on conditional random fields, achieves a precision of 91% and a recall of 66%. In this paper, the authors used 3-fold cross validation, i.e. using two thirds of the total data for training and one third for evaluation.
+
So instead we switched to a related problem, that of using a knowledge base (like Freebase, or DBpedia) to weakly supervise an information extractor.  The system we will start from is that of [[Hoffmann et al., ACL 2011]].  They were generous enough to provide both code and data, giving us a dataset and a baseline.
  
In our project, we would like to see if it is possible to achieve comparable performance with less labeled training data. We'll employ semi-supervised training, i.e. training with a small amount of labeled data (pages with infoboxes) and a large amount of unlabeled data (pages without infoboxes). The system will iteratively generate infoboxes for the unlabeled pages, and include some pages with high confidence into the labeled training set.
+
== Task ==
 +
 
 +
We will address the problem of [[Relation Extraction]], and we will address it using a set of known relations between entities in a knowledge base as our only supervision (in addition to a parser and a named-entity extractor).  Specifically, given a corpus of text and a set of relations <math>R</math>, return a set of entities <math>E</math> in the text along with instances <math>r(e_1, e_2)</math> of those relations between entities (the entities could be inputs instead of outputs, as well, or some entities could be input, with additional entities as output).
 +
 
 +
== Dataset ==
 +
 
 +
We will use as our corpus of text the New York Times Annotated Corpus, as provided by LDC.  We will use Freebase as a knowledge source to provide weak supervision to our system.  Details of the dataset are described in [[Hoffmann et al., ACL 2011]].
 +
 
 +
== Baseline ==
 +
 
 +
We will use the system of [[Hoffmann et al., ACL 2011]] as our baseline.  Their method is described in the linked page, so we will not repeat the description here. We have their code, so we can reproduce the results in their paper exactly.
 +
 
 +
== Our Big Idea ==
 +
 
 +
A key motivation for Hoffmann et al. was that their method was simpler than previous systems and it also allowed for multiple relations between pairs of entities (e.g., both CEOof(Jobs, Apple) and Founder(Jobs, Apple) - previous systems would have to pick one or the other).  Our idea is the Hoffmann et al. did not simplify it enough, and they do not allow multiple relations between pairs of entities in a single sentence - we go farther than Hoffmann et al. in both of their points.  For instance, the sentence "Steve Jobs, the founder and CEO of Apple, Inc., ..." clearly contains two relations between Apple and Steve Jobs, but the system of Hoffmann et al. would have to pick one or the other.  However, it is likely that such sentences are not incredibly common, so a model that only improved that point would not be incredibly interesting.  The reason we think this is an interesting experiment is that it should also simplify inference, giving decreased training and test time, because we will essentially be turning a multi-class classifier into a set of independent binary classifiers, reducing some coupling in the inference.
  
 
== References ==
 
== References ==
Line 17: Line 32:
 
[1] D. Lange, C. Böhm, F. Naumann, "Extracting Structured Information from Wikipedia Articles to Populate Infoboxes", CIKM, Oct 2010.
 
[1] D. Lange, C. Böhm, F. Naumann, "Extracting Structured Information from Wikipedia Articles to Populate Infoboxes", CIKM, Oct 2010.
  
== Comments from William ===
+
== Comments from William ==
  
 
This is a nice problem. Semi-supervised learning won't be covered will later in the class, though, so you guys will have to be proactive about finding the appropriate papers for this.  One nice paper that might get you started is: http://dl.acm.org/citation.cfm?id=1870675
 
This is a nice problem. Semi-supervised learning won't be covered will later in the class, though, so you guys will have to be proactive about finding the appropriate papers for this.  One nice paper that might get you started is: http://dl.acm.org/citation.cfm?id=1870675
Line 24: Line 39:
  
 
--[[User:Wcohen|Wcohen]] 20:59, 22 September 2011 (UTC)
 
--[[User:Wcohen|Wcohen]] 20:59, 22 September 2011 (UTC)
 +
 +
== More Comments from William ==
 +
 +
What baseline method and dataset are you using? --[[User:Wcohen|Wcohen]] 14:37, 11 October 2011 (UTC)

Latest revision as of 21:53, 13 October 2011

Team Members

Project Idea

Our first idea was to better populate wikipedia infoboxes using semi-supervised techniques. After looking a little more closely at that problem, we discovered that it was somewhat ill-posed, and that producing a baseline would be difficult, as recent systems that address that problem do not have freely available code.

So instead we switched to a related problem, that of using a knowledge base (like Freebase, or DBpedia) to weakly supervise an information extractor. The system we will start from is that of Hoffmann et al., ACL 2011. They were generous enough to provide both code and data, giving us a dataset and a baseline.

Task

We will address the problem of Relation Extraction, and we will address it using a set of known relations between entities in a knowledge base as our only supervision (in addition to a parser and a named-entity extractor). Specifically, given a corpus of text and a set of relations , return a set of entities in the text along with instances of those relations between entities (the entities could be inputs instead of outputs, as well, or some entities could be input, with additional entities as output).

Dataset

We will use as our corpus of text the New York Times Annotated Corpus, as provided by LDC. We will use Freebase as a knowledge source to provide weak supervision to our system. Details of the dataset are described in Hoffmann et al., ACL 2011.

Baseline

We will use the system of Hoffmann et al., ACL 2011 as our baseline. Their method is described in the linked page, so we will not repeat the description here. We have their code, so we can reproduce the results in their paper exactly.

Our Big Idea

A key motivation for Hoffmann et al. was that their method was simpler than previous systems and it also allowed for multiple relations between pairs of entities (e.g., both CEOof(Jobs, Apple) and Founder(Jobs, Apple) - previous systems would have to pick one or the other). Our idea is the Hoffmann et al. did not simplify it enough, and they do not allow multiple relations between pairs of entities in a single sentence - we go farther than Hoffmann et al. in both of their points. For instance, the sentence "Steve Jobs, the founder and CEO of Apple, Inc., ..." clearly contains two relations between Apple and Steve Jobs, but the system of Hoffmann et al. would have to pick one or the other. However, it is likely that such sentences are not incredibly common, so a model that only improved that point would not be incredibly interesting. The reason we think this is an interesting experiment is that it should also simplify inference, giving decreased training and test time, because we will essentially be turning a multi-class classifier into a set of independent binary classifiers, reducing some coupling in the inference.

References

[1] D. Lange, C. Böhm, F. Naumann, "Extracting Structured Information from Wikipedia Articles to Populate Infoboxes", CIKM, Oct 2010.

Comments from William

This is a nice problem. Semi-supervised learning won't be covered will later in the class, though, so you guys will have to be proactive about finding the appropriate papers for this. One nice paper that might get you started is: http://dl.acm.org/citation.cfm?id=1870675

You guys should also look into the Wu and Weld papers on Infobox extraction, which are quite nice.

--Wcohen 20:59, 22 September 2011 (UTC)

More Comments from William

What baseline method and dataset are you using? --Wcohen 14:37, 11 October 2011 (UTC)