Difference between revisions of "Brin 1998 Extracting Patterns and Relations from the World Wide Web"

From Cohen Courses
Jump to navigationJump to search
Line 15: Line 15:
 
The paper presents a study of the duality between patterns and relations, with a reference at the end to the latent semantic indexing for its mathematical background and further study. The author uses the example of extracting (author,book-title) pairs to show the applicability and scalability of this method.
 
The paper presents a study of the duality between patterns and relations, with a reference at the end to the latent semantic indexing for its mathematical background and further study. The author uses the example of extracting (author,book-title) pairs to show the applicability and scalability of this method.
  
== Approaches and Empirical Findings ==
+
== Algorithms and Experiments ==
The authors present a set of six approaches to exploit different information for attribute extraction. The author claims that though the approaches have different focus and its most applicable attributes, they can be applied for all the attributes and the combined back-off model performs the best in the experiments.
+
* Pattern, Relations and the Duality
 +
The authors first presents the problem of pattern extraction with target relations. Then the author shows a test problem of extracting (author, book-title) pairs and explain the patterns in details. The author gives an observation of the duality between the two, in such that given a good set of patterns, we can build a good set of tuples for the target relation and we also have the converse
 +
property - given a good set of tuples, we can build a good set of patterns.
  
* ''Improved Contextual Pattern-based Model''
+
* The DIPRE (Dual Iterative Pattern Relation Extraction) Algorithm
The author first refers to the contextual pattern-based model proposed in Ravichandran and Hovy (2002). The model is essentially a probabilistic model with template-based features, in which the features are derived from frequently occurred context. For example, in the case of "<Name> 'was born in' <Birthplace>", the <Name> and <Birthplace> are placeholders and the 'was born in' is the relevant context. The author also mentions the lost of variability for this method and proposes another method which takes the partially untethered template as the context. In their experiments, this method yields a significant gain in accuracy of 21%.
+
The algorithm is proposed for extracting relations using the pattern relation duality. The detail is as follows:
 +
1. R' <- Sample (seed)
 +
Start with a small sample, R' of the target relation. This sample is given by the user and can be very small. In the tests of author and titles, the author uses a list of five books with authors.
 +
2. O <- FindOccurrences(R';D)
 +
Then, fi�nd all occurrences of tuples of R' in D. Along with the tuple found, keep the context of every occurrence (url and surrounding text).
 +
3. P <- GenPatterns(O)
 +
Generate patterns based on the set of occurrences. As mentioned by the author, this routine must generate patterns for sets of occurrences with similar context. The patterns need to have a low
 +
error rate, so it is important that they are not overly general. The higher the coverage of the patterns the better. However, a low coverage can be compensated for with a larger database.
 +
4. R' <- MD(P). Search the database for tuples matching any of the patterns.
 +
5. If R' is large enough, return. Else go to step 2.
  
* ''Document-Position-Based Model''
+
* Example: Finding Authors and Titles
The author presents the intuition of position-based approach with Gaussian model for extracting different attributes. Using the figure below, the author shows the distributions clearly resembling the Gaussian. An alternative approach is the ranking-based model, which is claimed to be very effective in extraction of some attributes, for example Deathdate.
 
  
[[File:Nikesh_1.PNG]]
+
* Experiments
 
 
* ''Transitivity-Based Model''
 
 
The author mentions that the attributes such as Occupations are transitive in nature, in which the people names appearing close to the target would probably have the same occupation as the target name. Thus a transitivity-based model is proposed, mostly for the attributes such as Occupations, Religions etc. The following figure is used by the author to illustrate this approach.
 
The author mentions that the attributes such as Occupations are transitive in nature, in which the people names appearing close to the target would probably have the same occupation as the target name. Thus a transitivity-based model is proposed, mostly for the attributes such as Occupations, Religions etc. The following figure is used by the author to illustrate this approach.
 
[[File:Nikesh_2.PNG]]
 
 
* ''Latent-Model-Based Approach''
 
In addition to the transitivity-based model, the Occupation and other semantic attributes may also be modeled with a topic-modeling approach, as mentioned by the author. For example, a page about a scientist is definitely very different from a page about a basketball player. The author uses a scoring method similar to that of TD-IDF in information retrieval. And the author also points out that multilingual resources would help to resolve the ambiguity more for this approach.
 
 
* ''Attribute-Correlation-Based Filter''
 
Author gives an example of P(Reg.=Hindu|Nation=Indian)>>P(Reg.=Hindu|Nation=France) to illustrate the correlation between two attributes. Then the author presents their method which uses the training data to filter all unseen correlations in the output.
 
 
* ''Age-Distribution-Based Filter''
 
Author presents the intuition of using age range to filter wrong value of Birthdate and Deathdate pairs, this is claimed to have an average gain of 5% in their experiments.
 
  
 
== Related papers ==
 
== Related papers ==
  
 
The [[RelatedPaper::Ravichandran and Hovy, 2002]] paper serves as the foundation for this paper, in which it discusses a basic template-based approach for probabilistic modeling of attribute extractions. Later [[RelatedPaper: Mann and Yarowsky, 2005]] extends the methods for cross-document fusion of the attribute values.
 
The [[RelatedPaper::Ravichandran and Hovy, 2002]] paper serves as the foundation for this paper, in which it discusses a basic template-based approach for probabilistic modeling of attribute extractions. Later [[RelatedPaper: Mann and Yarowsky, 2005]] extends the methods for cross-document fusion of the attribute values.

Revision as of 22:00, 31 October 2010

Citation

Brin, S. 1998. Extracting patterns and relations from the world wide web. In Proceedings of WebDB Workshop of EDBT.

Online version

An online version of this paper is available at [1].

Summary

This paper introduces a novel approach to iteratively extract relations with a seed pattern set using the duality of patterns and relations. This inspires several following work on bootstrapping for large-scale information extraction.

Key Contributions

The paper presents a study of the duality between patterns and relations, with a reference at the end to the latent semantic indexing for its mathematical background and further study. The author uses the example of extracting (author,book-title) pairs to show the applicability and scalability of this method.

Algorithms and Experiments

  • Pattern, Relations and the Duality

The authors first presents the problem of pattern extraction with target relations. Then the author shows a test problem of extracting (author, book-title) pairs and explain the patterns in details. The author gives an observation of the duality between the two, in such that given a good set of patterns, we can build a good set of tuples for the target relation and we also have the converse property - given a good set of tuples, we can build a good set of patterns.

  • The DIPRE (Dual Iterative Pattern Relation Extraction) Algorithm

The algorithm is proposed for extracting relations using the pattern relation duality. The detail is as follows: 1. R' <- Sample (seed) Start with a small sample, R' of the target relation. This sample is given by the user and can be very small. In the tests of author and titles, the author uses a list of five books with authors. 2. O <- FindOccurrences(R';D) Then, fi�nd all occurrences of tuples of R' in D. Along with the tuple found, keep the context of every occurrence (url and surrounding text). 3. P <- GenPatterns(O) Generate patterns based on the set of occurrences. As mentioned by the author, this routine must generate patterns for sets of occurrences with similar context. The patterns need to have a low error rate, so it is important that they are not overly general. The higher the coverage of the patterns the better. However, a low coverage can be compensated for with a larger database. 4. R' <- MD(P). Search the database for tuples matching any of the patterns. 5. If R' is large enough, return. Else go to step 2.

  • Example: Finding Authors and Titles
  • Experiments

The author mentions that the attributes such as Occupations are transitive in nature, in which the people names appearing close to the target would probably have the same occupation as the target name. Thus a transitivity-based model is proposed, mostly for the attributes such as Occupations, Religions etc. The following figure is used by the author to illustrate this approach.

Related papers

The Ravichandran and Hovy, 2002 paper serves as the foundation for this paper, in which it discusses a basic template-based approach for probabilistic modeling of attribute extractions. Later RelatedPaper: Mann and Yarowsky, 2005 extends the methods for cross-document fusion of the attribute values.