Difference between revisions of "Brin 1998 Extracting Patterns and Relations from the World Wide Web"

From Cohen Courses
Jump to navigationJump to search
 
(21 intermediate revisions by one other user not shown)
Line 9: Line 9:
 
== Summary ==
 
== Summary ==
  
This [[Category::paper]] introduces a novel approach to iteratively extract relations with a seed pattern set using the duality of patterns and relations. This inspires several following work on bootstrapping for large-scale information extraction.
+
This [[Category::paper]] introduces a novel approach to iteratively extract relations with a seed pattern set using the duality of patterns and relations. This inspires several following work on [[AddressesProblem::bootstrapping for large-scale information extraction]].
  
 
== Key Contributions ==
 
== Key Contributions ==
  
The paper presents a study of the duality between patterns and relations, with a reference at the end to the latent semantic indexing for its mathematical background and further study. The author uses the example of extracting (author,book-title) pairs to show the applicability and scalability of this method.
+
The paper presents a study of the duality between patterns and relations, with a reference at the end to [[latent semantic indexing]] for its mathematical background and further study. The author uses the example of extracting (author,book-title) pairs to show the applicability and scalability of this method.
  
== Algorithms and Experiments ==
+
== Algorithms ==
 
* Pattern, Relations and the Duality
 
* Pattern, Relations and the Duality
 
The authors first presents the problem of pattern extraction with target relations. Then the author shows a test problem of extracting (author, book-title) pairs and explain the patterns in details. The author gives an observation of the duality between the two, in such that given a good set of patterns, we can build a good set of tuples for the target relation and we also have the converse
 
The authors first presents the problem of pattern extraction with target relations. Then the author shows a test problem of extracting (author, book-title) pairs and explain the patterns in details. The author gives an observation of the duality between the two, in such that given a good set of patterns, we can build a good set of tuples for the target relation and we also have the converse
Line 22: Line 22:
 
* The DIPRE (Dual Iterative Pattern Relation Extraction) Algorithm
 
* The DIPRE (Dual Iterative Pattern Relation Extraction) Algorithm
 
The algorithm is proposed for extracting relations using the pattern relation duality. The detail is as follows:
 
The algorithm is proposed for extracting relations using the pattern relation duality. The detail is as follows:
 +
 
<blockquote>
 
<blockquote>
1. R' <- Sample (seed)
+
1. R' <- Sample <br>
Start with a small sample, R' of the target relation. This sample is given by the user and can be very small. In the tests of author and titles, the author uses a list of five books with authors.
+
 
 +
Start a small sample, R' of the target relation. This sample is given by the user and can be very small. In the tests of author and titles, the author uses a list of five books with authors.<br><br>
 +
 
 +
2. O <- FindOccurrences(R';D)<br>
  
2. O <- FindOccurrences(R';D)
+
Then, find all occurrences of tuples of R' in D. Along with the tuple found, keep the context of every occurrence (url and surrounding text).<br><br>
Then, find all occurrences of tuples of R' in D. Along with the tuple found, keep the context of every occurrence (url and surrounding text).
+
 
 +
3. P <- GenPatterns(O)<br>
  
3. P <- GenPatterns(O)
 
 
Generate patterns based on the set of occurrences. As mentioned by the author, this routine must generate patterns for sets of occurrences with similar context. The patterns need to have a low
 
Generate patterns based on the set of occurrences. As mentioned by the author, this routine must generate patterns for sets of occurrences with similar context. The patterns need to have a low
error rate, so it is important that they are not overly general. The higher the coverage of the patterns the better. However, a low coverage can be compensated for with a larger database.
+
error rate, so it is important that they are not overly general. The higher the coverage of the patterns the better. However, a low coverage can be compensated for with a larger database.<br><br>
  
4. R' <- MD(P). Search the database for tuples matching any of the patterns.
+
4. R' <- MD(P)<br>
 +
 
 +
Search the database for tuples matching any of the patterns.<br><br>
  
 
5. If R' is large enough, return. Else go to step 2.
 
5. If R' is large enough, return. Else go to step 2.
 
</blockquote>
 
</blockquote>
  
* Example: Finding Authors and Titles
+
== Experiments and Evaluation ==
 
+
The author uses a repository of 24 million web pages, which is part of the [[UsesDataset::Stanford WebBase]] and is used for Google Search Engine as of 1998. The author also mentions an exclusion of the amazon pages due to crawling difficulty. The experiments start with only five books as the seed and a simple pattern, it grows with considerably fast-pace, although there are some bogus and it seems to be getting a lot of sci-fiction books as the author mentioned. At the final iteration, it has over 15,000 unique book titles.
* Experiments
 
The author mentions that the attributes such as Occupations are transitive in nature, in which the people names appearing close to the target would probably have the same occupation as the target name. Thus a transitivity-based model is proposed, mostly for the attributes such as Occupations, Religions etc. The following figure is used by the author to illustrate this approach.
 
 
 
== Related papers ==
 
  
The [[RelatedPaper::Ravichandran and Hovy, 2002]] paper serves as the foundation for this paper, in which it discusses a basic template-based approach for probabilistic modeling of attribute extractions. Later [[RelatedPaper: Mann and Yarowsky, 2005]] extends the methods for cross-document fusion of the attribute values.
+
The author chose a small set of output for manual verification. And it turns out that 19 out of 20 are correct with only one exception which refers to an article instead of a book. It also shows that many of the books are not in the Amazon list or other catalogs, which tells the power of information extraction over the web.

Latest revision as of 16:01, 7 December 2010

Citation

Brin, S. 1998. Extracting patterns and relations from the world wide web. In Proceedings of WebDB Workshop of EDBT.

Online version

An online version of this paper is available at [1].

Summary

This paper introduces a novel approach to iteratively extract relations with a seed pattern set using the duality of patterns and relations. This inspires several following work on bootstrapping for large-scale information extraction.

Key Contributions

The paper presents a study of the duality between patterns and relations, with a reference at the end to latent semantic indexing for its mathematical background and further study. The author uses the example of extracting (author,book-title) pairs to show the applicability and scalability of this method.

Algorithms

  • Pattern, Relations and the Duality

The authors first presents the problem of pattern extraction with target relations. Then the author shows a test problem of extracting (author, book-title) pairs and explain the patterns in details. The author gives an observation of the duality between the two, in such that given a good set of patterns, we can build a good set of tuples for the target relation and we also have the converse property - given a good set of tuples, we can build a good set of patterns.

  • The DIPRE (Dual Iterative Pattern Relation Extraction) Algorithm

The algorithm is proposed for extracting relations using the pattern relation duality. The detail is as follows:

1. R' <- Sample

Start a small sample, R' of the target relation. This sample is given by the user and can be very small. In the tests of author and titles, the author uses a list of five books with authors.

2. O <- FindOccurrences(R';D)

Then, find all occurrences of tuples of R' in D. Along with the tuple found, keep the context of every occurrence (url and surrounding text).

3. P <- GenPatterns(O)

Generate patterns based on the set of occurrences. As mentioned by the author, this routine must generate patterns for sets of occurrences with similar context. The patterns need to have a low error rate, so it is important that they are not overly general. The higher the coverage of the patterns the better. However, a low coverage can be compensated for with a larger database.

4. R' <- MD(P)

Search the database for tuples matching any of the patterns.

5. If R' is large enough, return. Else go to step 2.

Experiments and Evaluation

The author uses a repository of 24 million web pages, which is part of the Stanford WebBase and is used for Google Search Engine as of 1998. The author also mentions an exclusion of the amazon pages due to crawling difficulty. The experiments start with only five books as the seed and a simple pattern, it grows with considerably fast-pace, although there are some bogus and it seems to be getting a lot of sci-fiction books as the author mentioned. At the final iteration, it has over 15,000 unique book titles.

The author chose a small set of output for manual verification. And it turns out that 19 out of 20 are correct with only one exception which refers to an article instead of a book. It also shows that many of the books are not in the Amazon list or other catalogs, which tells the power of information extraction over the web.