# Improving Knowledge-Based Weakly Supervised Information Extraction

Jump to navigationJump to search

## Project Idea

Our first idea was to better populate wikipedia infoboxes using semi-supervised techniques. After looking a little more closely at that problem, we discovered that it was somewhat ill-posed, and that producing a baseline would be difficult, as recent systems that address that problem do not have freely available code.

So instead we switched to a related problem, that of using a knowledge base (like Freebase, or DBpedia) to weakly supervise an information extractor. The system we will start from is that of Hoffmann et al., ACL 2011. They were generous enough to provide both code and data, giving us a dataset and a baseline.

## Task

We will address the problem of Relation Extraction, and we will address it using a set of known relations between entities in a knowledge base as our only supervision (in addition to a parser and a named-entity extractor). Specifically, given a corpus of text and a set of relations ${\displaystyle R}$, return a set of entities ${\displaystyle E}$ in the text along with instances ${\displaystyle r(e_{1},e_{2})}$ of those relations between entities (the entities could be inputs instead of outputs, as well, or some entities could be input, with additional entities as output).

## Dataset

We will use as our corpus of text the New York Times Annotated Corpus, as provided by LDC. We will use Freebase as a knowledge source to provide weak supervision to our system. Details of the dataset are described in Hoffmann et al., ACL 2011.

## Baseline

We will use the system of Hoffmann et al., ACL 2011 as our baseline. Their method is described in the linked page, so we will not repeat the description here. We have their code, so we can reproduce the results in their paper exactly.

## Our Big Idea

A key motivation for Hoffmann et al. was that their method was simpler than previous systems and it also allowed for multiple relations between pairs of entities (e.g., both CEOof(Jobs, Apple) and Founder(Jobs, Apple) - previous systems would have to pick one or the other). Our idea is the Hoffmann et al. did not simplify it enough, and they do not allow multiple relations between pairs of entities in a single sentence - we go farther than Hoffmann et al. in both of their points. For instance, the sentence "Steve Jobs, the founder and CEO of Apple, Inc., ..." clearly contains two relations between Apple and Steve Jobs, but the system of Hoffmann et al. would have to pick one or the other. However, it is likely that such sentences are not incredibly common, so a model that only improved that point would not be incredibly interesting. The reason we think this is an interesting experiment is that it should also simplify inference, giving decreased training and test time, because we will essentially be turning a multi-class classifier into a set of independent binary classifiers, reducing some coupling in the inference.

## References

[1] D. Lange, C. Böhm, F. Naumann, "Extracting Structured Information from Wikipedia Articles to Populate Infoboxes", CIKM, Oct 2010.

## Comments from William

This is a nice problem. Semi-supervised learning won't be covered will later in the class, though, so you guys will have to be proactive about finding the appropriate papers for this. One nice paper that might get you started is: http://dl.acm.org/citation.cfm?id=1870675

You guys should also look into the Wu and Weld papers on Infobox extraction, which are quite nice.

--Wcohen 20:59, 22 September 2011 (UTC)

## More Comments from William

What baseline method and dataset are you using? --Wcohen 14:37, 11 October 2011 (UTC)