Difference between revisions of "Banko 2007 Open Information Extraction from the Web"

From Cohen Courses
Jump to navigationJump to search
 
(9 intermediate revisions by the same user not shown)
Line 9: Line 9:
 
== Summary ==
 
== Summary ==
  
This [[Category::paper]] introduces a novel approach to iteratively extract information from the web in the open domains, it also presents TextRunner, a large-scale open information extraction system with its current results and statistics.
+
This [[Category::paper]] introduces a novel approach to iteratively extract information from the web in the open domains, it also presents TextRunner, a large-scale [[AddressesProblem::open information extraction]] system with its current results and statistics.
  
 
== Key Contributions ==
 
== Key Contributions ==
Line 16: Line 16:
  
 
== The TextRunner System ==
 
== The TextRunner System ==
* System Architecture
 
 
The TextRunner system is designed as a fully automated open information extraction system. It takes a corpus as input and outputs a set of extractions that are also efficiently indexed to support
 
The TextRunner system is designed as a fully automated open information extraction system. It takes a corpus as input and outputs a set of extractions that are also efficiently indexed to support
 
exploration via user queries. As described in the paper, the system basically consists of three parts: (1) a self-supervised learner which gives a classifier to determine the "trustworthy" of any candidate extractions; (2) a single-pass extractor which makes a single pass over the entire corpus to extract tuples for all possible relations and sends each candidate to the classifier, retains the ones labeled as trustworthy; (3) a redundancy-based assessor which assigns a probability to each retained tuple based on a probabilistic model of redundancy in text.
 
exploration via user queries. As described in the paper, the system basically consists of three parts: (1) a self-supervised learner which gives a classifier to determine the "trustworthy" of any candidate extractions; (2) a single-pass extractor which makes a single pass over the entire corpus to extract tuples for all possible relations and sends each candidate to the classifier, retains the ones labeled as trustworthy; (3) a redundancy-based assessor which assigns a probability to each retained tuple based on a probabilistic model of redundancy in text.
  
1. ''Self-Supervised Learner''<br>
+
[1] ''Self-Supervised Learner''<br>
The algorithm is proposed for extracting relations using the pattern relation duality. The detail is as follows:
+
As mentioned, the Learner operates in two steps. First, it automatically labels its own training data as positive or negative. Second, it uses this labeled data to train a [[UsesMethod::Naive Bayes]] classifier, which is then used by the Extractor module. Here, the self-supervision refers to the approach that the learner uses its own labels as training data. As the author pointed out, the parser is not used at the web-scale, but it is used to help train an Extractor.
  
2. ''''
+
[2] ''Single-Pass Extractor''<br>
 +
The Extractor makes a single pass over its corpus, perform [[UsesMethod::POS Tagging]] to automatically label each word in each sentence with its most probable part-of-speech. Using these tags, entities are found by identifying noun phrases using a lightweight chunker. Relations are found by examining the text between the noun phrases and heuristically eliminating non-essential phrases or individual tokens. Each candidate tuple is presented to the classifier. The trustworthy ones are extracted and stored by TextRunner.
 +
 
 +
[3] ''Redundancy-based Assessor''<br>
 +
TextRunner creates a normalized form of the relation that omits modifiers to verbs and nouns. After extraction has been performed over the entire corpus, TextRunner automatically merges tuples where both entities and normalized relation are identical and counts the number of distinct sentences from which each extraction was found. Then, the Assessor uses these counts to assign a probability to each tuple using the probabilistic model previously applied to unsupervised IE in the KnowItAll system.
 +
 
 +
The TextRunner system also has a separate module of query processing for user interactions. This is essentially implemented as a relation-centric index system, it is described in [[RelatedPaper::Cafarella et al., 2006]].
  
 
== Experiments and Evaluation ==
 
== Experiments and Evaluation ==
The author uses a repository of 24 million web pages, which is part of the Stanford WebBase and is used for Google Search Engine as of 1998. The author also mentions an exclusion of the amazon pages due to crawling difficulty. The experiments start with only five books as the seed and a simple pattern, it grows with considerably fast-pace, although there are some bogus and it seems to be getting a lot of sci-fiction books as the author mentioned. At the final iteration, it has over 15,000 unique book titles.
+
The author first presents a recall and error rate of TextRunner to that of traditional IE system on a closed set of relations. Then the author turns into the results on open domain IE and shows the global statistics on learned facts.
 +
 
 +
In the comparison with traditional IE, the state-of-art KnowItAll system is picked and the experiments run on [[UsesDataset::UW Web Corpus]] which has 9 Million web pages with 10 pre-selected relations. TextRunner achieves 33% error rate reduction with approximately the same number of extractions. On the other hand, the efficiency of TextRunner system is far better than that of KnowItAll system.
  
The author chose a small set of output for manual verification. And it turns out that 19 out of 20 are correct with only one exception which refers to an article instead of a book. It also shows that many of the books are not in the Amazon list or other catalogs, which tells the power of information extraction over the web.
+
In the results on open domain IE, the authors focus on two most important parameters: the correctness of the facts and the number of distinct facts extracted. And the estimation shows that among 7.8 million well-formed tuples with probability ≥ 0.8, TextRunner finds 1 million concrete tuples with arguments grounded in particular real-world entities, 88.1% of which are correct, and 6.8 million tuples reflecting abstract assertions, 79.2% of which are correct.

Latest revision as of 21:46, 22 November 2010

Citation

Banko, M., Cafarella, M., Soderland, S., Broadhead, M. and Etzioni, O. 2007. Open Information Extraction from the Web. In Proceedings of IJCAI.

Online version

An online version of this paper is available at [1].

Summary

This paper introduces a novel approach to iteratively extract information from the web in the open domains, it also presents TextRunner, a large-scale open information extraction system with its current results and statistics.

Key Contributions

The biggest contribution claimed by the authors in this paper is the new paradigm of open information extraction which does not require any human input, it makes a single run through the data and generates large set of relational tuples. Another contribution of this paper is the analysis of TextRunner system and its current results.

The TextRunner System

The TextRunner system is designed as a fully automated open information extraction system. It takes a corpus as input and outputs a set of extractions that are also efficiently indexed to support exploration via user queries. As described in the paper, the system basically consists of three parts: (1) a self-supervised learner which gives a classifier to determine the "trustworthy" of any candidate extractions; (2) a single-pass extractor which makes a single pass over the entire corpus to extract tuples for all possible relations and sends each candidate to the classifier, retains the ones labeled as trustworthy; (3) a redundancy-based assessor which assigns a probability to each retained tuple based on a probabilistic model of redundancy in text.

[1] Self-Supervised Learner
As mentioned, the Learner operates in two steps. First, it automatically labels its own training data as positive or negative. Second, it uses this labeled data to train a Naive Bayes classifier, which is then used by the Extractor module. Here, the self-supervision refers to the approach that the learner uses its own labels as training data. As the author pointed out, the parser is not used at the web-scale, but it is used to help train an Extractor.

[2] Single-Pass Extractor
The Extractor makes a single pass over its corpus, perform POS Tagging to automatically label each word in each sentence with its most probable part-of-speech. Using these tags, entities are found by identifying noun phrases using a lightweight chunker. Relations are found by examining the text between the noun phrases and heuristically eliminating non-essential phrases or individual tokens. Each candidate tuple is presented to the classifier. The trustworthy ones are extracted and stored by TextRunner.

[3] Redundancy-based Assessor
TextRunner creates a normalized form of the relation that omits modifiers to verbs and nouns. After extraction has been performed over the entire corpus, TextRunner automatically merges tuples where both entities and normalized relation are identical and counts the number of distinct sentences from which each extraction was found. Then, the Assessor uses these counts to assign a probability to each tuple using the probabilistic model previously applied to unsupervised IE in the KnowItAll system.

The TextRunner system also has a separate module of query processing for user interactions. This is essentially implemented as a relation-centric index system, it is described in Cafarella et al., 2006.

Experiments and Evaluation

The author first presents a recall and error rate of TextRunner to that of traditional IE system on a closed set of relations. Then the author turns into the results on open domain IE and shows the global statistics on learned facts.

In the comparison with traditional IE, the state-of-art KnowItAll system is picked and the experiments run on UW Web Corpus which has 9 Million web pages with 10 pre-selected relations. TextRunner achieves 33% error rate reduction with approximately the same number of extractions. On the other hand, the efficiency of TextRunner system is far better than that of KnowItAll system.

In the results on open domain IE, the authors focus on two most important parameters: the correctness of the facts and the number of distinct facts extracted. And the estimation shows that among 7.8 million well-formed tuples with probability ≥ 0.8, TextRunner finds 1 million concrete tuples with arguments grounded in particular real-world entities, 88.1% of which are correct, and 6.8 million tuples reflecting abstract assertions, 79.2% of which are correct.