Crescenzi et al, 2001

From Cohen Courses
Revision as of 05:51, 1 November 2010 by PastStudents (talk | contribs)
Jump to navigationJump to search

Citation

V. Crescenzi, G. Mecca, and P. Merialdo. ROAD RUNNER: Towards automatic data extraction from large web sites. In Proc. of the 2001 Intl. Conf. on Very Large Data Bases, pages 109–118, 2001.


Online version

[[1]]

Summary

This paper introduces a novel technique for automatic wrapper generation by comparing HTML pages and building a wrapper based on the similarity between web pages. This technique can be applied on websites that contain large amount of data (i.e. data-intensive). They also have assumed that the webpages of the given website have fairly similar structure. The main advantages of this technique are:

- This technique does not require any interaction with user during the process of wrapper generation. This extends the applicability of their technique to automatically learn wrappers for input website without getting any supervision from human.

- The technique doesn't have any prior knowledge about the structure of the input web pages.

Given two web pages as the input of the system, this technique compares the content of these two web pages and generate a wrapper by comparing similarities and dissimilarities of these two pages. They have developed a matching technique to extract wrapper from the input webpages. The matching algorithm works on two objects in parallel: (1) a list of tokens and (2) a wrapper. It initially considers one of the input webpages as a wrapper and then iteratively refine the wrapper by processing new web pages. When it processes the new web pages it may finds a mismatch between the structure of the web page and the current wrapper. In these cases it tries to generalize the wrapper to solve the mismatch.

This technique is tested on several well known data-intensive web sites. For each web site they download 10-20 pages with similar structure. These web pages are given to the program to generate a wrapper. The results show that this technique has been able to extract dataset of 8 websites (among 10 different tested websites).


Related papers