Difference between revisions of "Open Information Extraction"

Revision as of 11:06, 30 October 2010

Problem

Most existing Information Extraction (IE) systems use supervised learning on relation specific data. Relations are predetermined and this constrain limits most of existing IE systems to be domain specific. Open Information Extraction does not predetermine relations. The goal of an Open IE system is to scale to the diversity and size of huge data such as the Web given little or even no training data.

Input: text corpus
Output: set of relation in the triple form <arg1, relation, arg2>

History

Open IE has a relatively short history. The first scalable Open IE system, TextRunner (Banko_et_al_IJCAI_2007), was first proposed in 2007 by Oren Etzioni's group in University of Washington. Afterward, there are a few follow up studies proposed by the same group to improve the system. Recently, Tom Mitchell's group in Carnegie Mellon University proposed another Open IE system, NELL (Carlson et al., AAAI 2010).

State of the Art

Two systems were proposed in early 2010: WOE (Wu_and_Weld_ACL2010) and NELL. WOE was developed based on TextRunner. The system extract training data from Wikipedia pages and train two extractors with two different set of features: i) a simple classifier using dependency paths features, and ii) a CRF with same features as in TextRunner. The first extractor is 30 times slower than the second extractor, and the improvement of second extractor over the original TextRunner was mainly due to the training data extracted Wikipedia pages.

NELL is another state of the art system. The system uses a Expectation Maximization like framework which starts with a handful number of training samples for each relations. In the expectation step, beliefs receive some assignment with the knowledge base, and in the maximization step learners use the updated knowledge bases to find new beliefs. The key idea of NELL is the coupled learning framework which learns multiple functions at the same time. This way, data will be less likely to get noisier.

@@ Line 10: / Line 10: @@
 == State of the Art ==
-Two systems were proposed in early 2010: WOE [[Wu_and_Weld_ACL2010]] and [http://rtw.ml.cmu.edu/rtw/ NELL]. WOE was developed based on TextRunner. The system extract training data from Wikipedia pages and train two extractors with two different set of features: i) a simple classifier using dependency paths features, and ii) a CRF with same features as in TextRunner. The first extractor is 30 times slower than the second extractor, and the improvement of second extractor over the original TextRunner was mainly due to the training data extracted Wikipedia pages.
+Two systems were proposed in early 2010: WOE ([[Wu_and_Weld_ACL2010]]) and [http://rtw.ml.cmu.edu/rtw/ NELL]. WOE was developed based on TextRunner. The system extract training data from Wikipedia pages and train two extractors with two different set of features: i) a simple classifier using dependency paths features, and ii) a CRF with same features as in TextRunner. The first extractor is 30 times slower than the second extractor, and the improvement of second extractor over the original TextRunner was mainly due to the training data extracted Wikipedia pages.
 [http://rtw.ml.cmu.edu/rtw/ NELL] is another state of the art system. The system uses a Expectation Maximization like framework which starts with a handful number of training samples for each relations. In the expectation step, beliefs receive some assignment with the knowledge base, and in the maximization step learners use the updated knowledge bases to find new beliefs. The key idea of NELL is the  coupled learning framework which learns multiple functions at the same time. This way, data will be less likely to get noisier.

Difference between revisions of "Open Information Extraction"

Revision as of 11:06, 30 October 2010

Problem

History

State of the Art

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools