Difference between revisions of "Crescenzi et al, 2001"

Latest revision as of 18:01, 24 November 2010

Citation

V. Crescenzi, G. Mecca, and P. Merialdo. ROAD RUNNER: Towards automatic data extraction from large web sites. In Proc. of the 2001 Intl. Conf. on Very Large Data Bases, pages 109–118, 2001.

Online version

[[1]]

Summary

This paper introduces a novel technique for automatic wrapper generation by comparing HTML pages and building a wrapper based on the similarity between web pages. This technique can be applied on websites that contain large amount of data (i.e. data-intensive). They also have assumed that the webpages of the given website have fairly similar structure. The main advantages of this technique are:

- This technique does not require any interaction with user during the process of wrapper generation. This extends the applicability of their technique to automatically learn wrappers for input website without getting any supervision from human.

- The technique doesn't have any prior knowledge about the structure of the input web pages.

Given two web pages as the input of the system, this technique compares the content of these two web pages and generate a wrapper by comparing similarities and dissimilarities of these two pages. They have developed a matching technique to extract wrapper from the input webpages. The matching algorithm works on two objects at the same time: (1) a list of tokens and (2) a wrapper. It initially considers one of the input webpages as a wrapper and then iteratively refines the wrapper by processing new web pages. When it processes the new web pages it may finds a mismatch between the structure of the web page and the current wrapper. In these cases it tries to generalize the wrapper to solve the mismatch.

Given two webpages, the algorithm first converts the HTML files to XHTML format. It then selects one of the webpages randomly, for example page 1, and builds a wrapper based on it. The wrapper is then refined based on the content of the second webpage by solving mismatches between wrapper and content of the second webpage. There are two kinds of mismatches:

- String mismatches: mismatches between different strings that occur at the same position in the input webpages.

- Tag mismatches: mismatches between different tags in the wrapper and the new webpage.

In most of the cases mismatches can be solved using the followings ways:

- String mismatches may refer to different values of a database. In this case we just add these values to the database.

- Tag mismatches (discovering optionals): If a mismatch happens for two tags then it means that either we have an optional field or an iterator. In the case of optional field, we have either in the new webpage or in the wrapper a piece of HTML code that is not presented on the other side. In this case we first determine the boundary of the optional field and then generalize the wrapper to cover this optional field.

- Tag mismatches (iterator): Another reason for tag mismatches is different cardinalities in the the number of items in the webpage (e.g. number of book titles in a webpage). To solve the mismatches, we need to identify all these repeated patterns. The paper has described the detail of their technique to handle this case in the paper.

We mentioned in the above an example of solving mismatches. In the paper they have discussed more complex examples in details to solve mismatches. Based on the different types of mismatches the wrapper is generalized.

This technique is tested on several well known data-intensive web sites. For each web site they have downloaded 10-20 pages with similar structure. These web pages are given to the program to generate a wrapper. The results show that this technique has been able to extract dataset of 8 websites (among 10 different tested websites).

Difference between revisions of "Crescenzi et al, 2001"

Latest revision as of 18:01, 24 November 2010

Citation

Online version

Summary

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools

@@ Line 1: / Line 1: @@
 == Citation ==
-Carlson, A., S. Schafer. 2008. Bootstrapping Information Extraction from Semi-structured Web Pages. ECML PKDD '08: Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I, 2008, 195-210, Berlin, Heidelberg.
+V. Crescenzi, G. Mecca, and P. Merialdo. ROAD RUNNER:
+Towards automatic data extraction from large web sites. In
+Proc. of the 2001 Intl. Conf. on Very Large Data Bases,
+pages 109–118, 2001.
 == Online version ==
-[[www.cs.cmu.edu/~acarlson/papers/carlson-ecml08.pdf|Carlson-ECML08]]
+[[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.21.8672&rep=rep1&type=pdf|RoadRunner]]
 == Summary ==
@@ Line 15: / Line 19: @@
 - The technique doesn't have any prior knowledge about the structure of the input web pages.
+Given two web pages as the input of the system, this technique compares the content of these two web pages and generate a wrapper by comparing similarities and dissimilarities of these two pages. They have developed a matching technique to extract wrapper from the input webpages. The matching algorithm works on two objects at the same time: (1) a list of tokens and (2) a wrapper. It initially considers one of the input webpages as a wrapper and then iteratively refines the wrapper by processing new web pages. When it processes the new web pages it may finds a mismatch between the structure of the web page and the current wrapper. In these cases it tries to generalize the wrapper to solve the mismatch.
+Given two webpages, the algorithm first converts the HTML files to XHTML format. It then selects one of the webpages randomly, for example page 1, and builds a wrapper based on it. The wrapper is then refined based on the content of the second webpage by solving <i> mismatches </i> between wrapper and content of the second webpage. There are two kinds of mismatches:
+- String mismatches: mismatches between different strings that occur at the same position in the input webpages.
+- Tag mismatches: mismatches between different tags in the wrapper and the new webpage.
+In most of the cases mismatches can be solved using the followings ways:
+- String mismatches may refer to different values of a database. In this case we just add these values to the database.
-The intuition behind their technique is to use global features to infer rules about the local features. For example suppose that we know the name of a set of books. Then by looking at webpages of Amazon.com and by searching the name of books we can infer that the position and font of the book title
+- Tag mismatches (discovering optionals): If a mismatch happens for two tags then it means that either we have an optional field or an iterator. In the case of optional field, we have either in the new webpage or in the wrapper a piece of HTML code that is not presented on the other side. In this case we first determine the boundary of the optional field and then generalize the wrapper to cover this optional field.
-is the same in most the webpages. We can then use these two features (position and font of book title in web pages) to extract new book titles.
-They have described both generative and discriminative approaches for classification and extraction tasks. Global features are govern by parameters that are shared by all data and local features are shared only by a subset of data. For example in information extraction task, all the words in a webpage (without considering formatting) can be considered as global features. On the other hand, features such as position of a text box or color of a text are local features.
+- Tag mismatches (iterator): Another reason for tag mismatches is different cardinalities in the the number of items in the webpage (e.g. number of book titles in a webpage). To solve the mismatches, we need to identify all these repeated patterns. The paper has described the detail of their technique to handle this case in the paper.
-They have tested their method on two different datasets. The first dataset contains 1000 HTML documents. Each document is automatically divided into a set of words with similar layout characteristics and then are hand-labeled as containing or not containing a job title. The local and global features for this domain are the same as what we discussed above. The second dataset contain 42,548 web pages from 330 web sites which each web page is hand-labeled as if it is a press release or not press release. The global feature is a set of word in each webpage and local feature is the URL of the webpage. Their experimental result have shown that this approach can obtain high precision and low/moderate recall.
+We mentioned in the above an example of solving mismatches. In the paper they have discussed more complex examples in details to solve mismatches. Based on the different types of mismatches the wrapper is generalized.
-== Related papers ==
+This technique is tested on several well known data-intensive web sites. For each web site they have downloaded 10-20 pages with similar structure. These web pages are given to the program to generate a wrapper. The results show  that this technique has been able to extract dataset of 8 websites (among 10 different tested websites).