Open Information Extraction
Problem
Most existing Information Extraction (IE) systems use supervised learning on relation specific data. Relations are predetermined and this constrain limits most of existing IE systems to be domain specific. Open Information Extraction does not predetermine relations. The goal of an Open IE system is to scale to the diversity and size of huge data such as the Web given little or even no training data.
- Input: text corpus
- Output: set of relation in the triple form <arg1, relation, arg2>
History
Open IE has a relatively short history. The first scalable Open IE system, TextRunner (Banko_et_al_IJCAI_2007), was proposed in 2007 by Oren Etzioni's group in University of Washington. Afterward, there are a few follow up studies proposed by the same group to improve the system. Recently, Tom Mitchell's group in Carnegie Mellon University proposed another Open IE system, NELL (Carlson et al., AAAI 2010).
State of the Art
Two systems were proposed in early 2010: WOE (Wu_and_Weld_ACL2010) and NELL. WOE was developed based on TextRunner. The system extract training data from Wikipedia pages and train two extractors with two different sets of features: i) a simple classifier using dependency paths features, and ii) a CRF with same features as in TextRunner. The first extractor is 30 times slower than the second extractor, and the improvement of second extractor over the original TextRunner was mainly due to the training data extracted from Wikipedia pages.
NELL is another state of the art system. The system uses a Expectation Maximization like framework which starts with a handful number of training samples for each relations. In the expectation step, beliefs receive some assignment with the knowledge base, and in the maximization step learners use the updated knowledge bases to find new beliefs. The key idea of NELL is the coupled learning framework which learns multiple functions at the same time. This way, data will be less likely to get noisier.