Liuy writeup of Brin

From Cohen Courses
Jump to navigationJump to search

This is a review of Brin_1999_extracting_patterns_and_relations_from_the_world_wide_web by user:Liuy.

The paper tries to extract structured data from vast www, which however is very distributed. It automatically extracts a relation for such a data type from all related distributed sources. Technically, they use duality between pattern sets and relations. Starting from a small sample, they grow towards the target relation. In particular, DIPRE starts with a sample set of several items (can be book, music, movie) and expand it to a list of a vast amount of books, almost automized. They build an inverted index of the repository of 24 million web pages as subset of Stanford WebBase, and conduct pattern relation expansion on the set. However, I have some doubts on the quality of the results. For such a large amount and extremely distributed data, they might do some prepocessing of the data to experiment with. So their results may not purely attribute to the algorithm they proposed in the paper.