Difference between revisions of "Brin 1998 Extracting Patterns and Relations from the World Wide Web"

From Cohen Courses
Jump to navigationJump to search
 
(29 intermediate revisions by one other user not shown)
Line 9: Line 9:
 
== Summary ==
 
== Summary ==
  
This [[Category::paper]] introduces a novel approach to iteratively extract relations with a seed pattern set using the duality of attributes and relation patterns. This inspires several following work on bootstrapping for large-scale information extraction.
+
This [[Category::paper]] introduces a novel approach to iteratively extract relations with a seed pattern set using the duality of patterns and relations. This inspires several following work on [[AddressesProblem::bootstrapping for large-scale information extraction]].
  
 
== Key Contributions ==
 
== Key Contributions ==
  
The paper presents a series of intuitive hypothesis with supporting empirical findings, which motivates the proposed approaches. The paper also suggests the use of NNDB in parallel with wikipedia text for training and testing.
+
The paper presents a study of the duality between patterns and relations, with a reference at the end to [[latent semantic indexing]] for its mathematical background and further study. The author uses the example of extracting (author,book-title) pairs to show the applicability and scalability of this method.
  
* Several intuitive hypotheses are presented, such as the position information, the topic information and correlations between the attributes are useful for the extraction. These hypotheses are also backed by empirical study which leads to skewed distribution. These hypotheses then motivated the position-based model, the latent-model-based model, the correlation-based filter etc.
+
== Algorithms ==
 +
* Pattern, Relations and the Duality
 +
The authors first presents the problem of pattern extraction with target relations. Then the author shows a test problem of extracting (author, book-title) pairs and explain the patterns in details. The author gives an observation of the duality between the two, in such that given a good set of patterns, we can build a good set of tuples for the target relation and we also have the converse
 +
property - given a good set of tuples, we can build a good set of patterns.
  
* The paper focuses on the problem of extracting attributes from wikipedia text. However, the information box of wikipedia is usually incomplete, thus the use of NNDB is proposed to serve as the gold standard for the attribute extraction. Though the author does not specify explicitly, the use of any state-of-art clean person-attribute database is well suited for this purpose, too.
+
* The DIPRE (Dual Iterative Pattern Relation Extraction) Algorithm
 +
The algorithm is proposed for extracting relations using the pattern relation duality. The detail is as follows:
  
== Approaches and Empirical Findings ==
+
<blockquote>
The authors present a set of six approaches to exploit different information for attribute extraction. The author claims that though the approaches have different focus and its most applicable attributes, they can be applied for all the attributes and the combined back-off model performs the best in the experiments.
+
1. R' <- Sample <br>
  
* ''Improved Contextual Pattern-based Model''
+
Start a small sample, R' of the target relation. This sample is given by the user and can be very small. In the tests of author and titles, the author uses a list of five books with authors.<br><br>
The author first refers to the contextual pattern-based model proposed in Ravichandran and Hovy (2002). The model is essentially a probabilistic model with template-based features, in which the features are derived from frequently occurred context. For example, in the case of "<Name> 'was born in' <Birthplace>", the <Name> and <Birthplace> are placeholders and the 'was born in' is the relevant context. The author also mentions the lost of variability for this method and proposes another method which takes the partially untethered template as the context. In their experiments, this method yields a significant gain in accuracy of 21%.
 
  
* ''Document-Position-Based Model''
+
2. O <- FindOccurrences(R';D)<br>
The author presents the intuition of position-based approach with Gaussian model for extracting different attributes. Using the figure below, the author shows the distributions clearly resembling the Gaussian. An alternative approach is the ranking-based model, which is claimed to be very effective in extraction of some attributes, for example Deathdate.
 
  
[[File:Nikesh_1.PNG]]
+
Then, find all occurrences of tuples of R' in D. Along with the tuple found, keep the context of every occurrence (url and surrounding text).<br><br>
  
* ''Transitivity-Based Model''
+
3. P <- GenPatterns(O)<br>
The author mentions that the attributes such as Occupations are transitive in nature, in which the people names appearing close to the target would probably have the same occupation as the target name. Thus a transitivity-based model is proposed, mostly for the attributes such as Occupations, Religions etc. The following figure is used by the author to illustrate this approach.
 
  
[[File:Nikesh_2.PNG]]
+
Generate patterns based on the set of occurrences. As mentioned by the author, this routine must generate patterns for sets of occurrences with similar context. The patterns need to have a low
 +
error rate, so it is important that they are not overly general. The higher the coverage of the patterns the better. However, a low coverage can be compensated for with a larger database.<br><br>
  
* ''Latent-Model-Based Approach''
+
4. R' <- MD(P)<br>
In addition to the transitivity-based model, the Occupation and other semantic attributes may also be modeled with a topic-modeling approach, as mentioned by the author. For example, a page about a scientist is definitely very different from a page about a basketball player. The author uses a scoring method similar to that of TD-IDF in information retrieval. And the author also points out that multilingual resources would help to resolve the ambiguity more for this approach.
 
  
* ''Attribute-Correlation-Based Filter''
+
Search the database for tuples matching any of the patterns.<br><br>
Author gives an example of P(Reg.=Hindu|Nation=Indian)>>P(Reg.=Hindu|Nation=France) to illustrate the correlation between two attributes. Then the author presents their method which uses the training data to filter all unseen correlations in the output.
 
  
* ''Age-Distribution-Based Filter''
+
5. If R' is large enough, return. Else go to step 2.
Author presents the intuition of using age range to filter wrong value of Birthdate and Deathdate pairs, this is claimed to have an average gain of 5% in their experiments.
+
</blockquote>
  
== Related papers ==
+
== Experiments and Evaluation ==
 +
The author uses a repository of 24 million web pages, which is part of the [[UsesDataset::Stanford WebBase]] and is used for Google Search Engine as of 1998. The author also mentions an exclusion of the amazon pages due to crawling difficulty. The experiments start with only five books as the seed and a simple pattern, it grows with considerably fast-pace, although there are some bogus and it seems to be getting a lot of sci-fiction books as the author mentioned. At the final iteration, it has over 15,000 unique book titles.
  
The [[RelatedPaper::Ravichandran and Hovy, 2002]] paper serves as the foundation for this paper, in which it discusses a basic template-based approach for probabilistic modeling of attribute extractions. Later [[RelatedPaper: Mann and Yarowsky, 2005]] extends the methods for cross-document fusion of the attribute values.
+
The author chose a small set of output for manual verification. And it turns out that 19 out of 20 are correct with only one exception which refers to an article instead of a book. It also shows that many of the books are not in the Amazon list or other catalogs, which tells the power of information extraction over the web.

Latest revision as of 16:01, 7 December 2010

Citation

Brin, S. 1998. Extracting patterns and relations from the world wide web. In Proceedings of WebDB Workshop of EDBT.

Online version

An online version of this paper is available at [1].

Summary

This paper introduces a novel approach to iteratively extract relations with a seed pattern set using the duality of patterns and relations. This inspires several following work on bootstrapping for large-scale information extraction.

Key Contributions

The paper presents a study of the duality between patterns and relations, with a reference at the end to latent semantic indexing for its mathematical background and further study. The author uses the example of extracting (author,book-title) pairs to show the applicability and scalability of this method.

Algorithms

  • Pattern, Relations and the Duality

The authors first presents the problem of pattern extraction with target relations. Then the author shows a test problem of extracting (author, book-title) pairs and explain the patterns in details. The author gives an observation of the duality between the two, in such that given a good set of patterns, we can build a good set of tuples for the target relation and we also have the converse property - given a good set of tuples, we can build a good set of patterns.

  • The DIPRE (Dual Iterative Pattern Relation Extraction) Algorithm

The algorithm is proposed for extracting relations using the pattern relation duality. The detail is as follows:

1. R' <- Sample

Start a small sample, R' of the target relation. This sample is given by the user and can be very small. In the tests of author and titles, the author uses a list of five books with authors.

2. O <- FindOccurrences(R';D)

Then, find all occurrences of tuples of R' in D. Along with the tuple found, keep the context of every occurrence (url and surrounding text).

3. P <- GenPatterns(O)

Generate patterns based on the set of occurrences. As mentioned by the author, this routine must generate patterns for sets of occurrences with similar context. The patterns need to have a low error rate, so it is important that they are not overly general. The higher the coverage of the patterns the better. However, a low coverage can be compensated for with a larger database.

4. R' <- MD(P)

Search the database for tuples matching any of the patterns.

5. If R' is large enough, return. Else go to step 2.

Experiments and Evaluation

The author uses a repository of 24 million web pages, which is part of the Stanford WebBase and is used for Google Search Engine as of 1998. The author also mentions an exclusion of the amazon pages due to crawling difficulty. The experiments start with only five books as the seed and a simple pattern, it grows with considerably fast-pace, although there are some bogus and it seems to be getting a lot of sci-fiction books as the author mentioned. At the final iteration, it has over 15,000 unique book titles.

The author chose a small set of output for manual verification. And it turns out that 19 out of 20 are correct with only one exception which refers to an article instead of a book. It also shows that many of the books are not in the Amazon list or other catalogs, which tells the power of information extraction over the web.