Difference between revisions of "Web Data Extraction Based on Partial Tree Alignment"

From Cohen Courses
Jump to navigationJump to search
(Created page with '== Citation == Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW. (2005) 76–85. == Online version == [[http://citeseerx.ist.psu.edu/viewdoc/dow…')
 
 
(7 intermediate revisions by the same user not shown)
Line 9: Line 9:
 
== Summary ==
 
== Summary ==
  
This paper studies the problem of extracting structured records from semi structured web pages [[Category::problem]] which has been recently studied in several researches. Most of the techniques in extracting structured information from the Web are limited by either the following two limitations: they require human labeling of many web pages or they have made many assumptions that are not applicable to many web sites.  
+
This [[Category::paper]] studies the problem of [[AddressesProblem::extracting data from semi-structured web pages]] which has been widely studied in information extraction community. Most of the techniques in extracting structured information from the Web are limited by either of the following two limitations: they require human labeling of many web pages or they have made many assumptions that are not applicable to many web sites.  
  
 
+
This paper presents a novel technique which doesn't have the above limitations. The method has two phases: 1- identifying data fields in the input web page, and 2- extracting data from the identified data fields. To identify data fields in the web page they have used [[UsesMethod:: MDR algorithm]] which is presented in detail in [1]. After identifying appropriate fields in the given web page, they have used partial tree alignment technique to extract data from each field.
  
Many of the developed techniques require significant human effort to annotate data for each website or they require a heuristic to be able to extract data types that exist in the webpage. This [[Category::paper]] introduces a novel approach to [[AddressesProblem::extract semi structured data from web pages]]  by requiring annotating only a few pages for very few websites.
+
The main differences between this paper and the method that is presented in [[UsesMethod:: MDR algorithm]] are the followings:
  
This method first requires a set of web pages which are annotated by human. The annotator should decide what schema columns are interesting are presenting in the input web pages and should also annotates a very small number of web pages for four or six websites. Given this training data, program trains four different classifier (using different types of features) to classify data for each of the annotated fields. Using these trained classifiers, it then tries to extract data that maximize confidence value of trained classifiers.
+
- They have improved [[UsesMethod:: MDR algorithm]] using visual information to be able to identify individual data records more accurately.
  
To evaluate their method they have used [[UsesMethod:: regularized logistic]] regression classifier as the baseline method. The technique is tested on two different vacation rentals and job sites. They have shown that by annotating 2-5 pages for 4-6 web sites their technique can achieve an accuracy of 84% on job offer sites and 91% on vacation rental sites.
+
- They have also proposed a new technique to extract content of each data field based on tree matching techniques.
  
== Related papers ==
+
They have tested their systems on different websites. They have shown the recall result of their method is about 98.18% and have obtained precision of 99.68%. 
 +
 
 +
 
 +
 
 +
== References ==
 +
[1] Liu, B., Grossman, R. and Zhai, Y. “Mining data records from Web pages.” KDD-03, 2003.

Latest revision as of 14:56, 26 October 2010

Citation

Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW. (2005) 76–85.

Online version

[|Zhai-WWW05]

Summary

This paper studies the problem of extracting data from semi-structured web pages which has been widely studied in information extraction community. Most of the techniques in extracting structured information from the Web are limited by either of the following two limitations: they require human labeling of many web pages or they have made many assumptions that are not applicable to many web sites.

This paper presents a novel technique which doesn't have the above limitations. The method has two phases: 1- identifying data fields in the input web page, and 2- extracting data from the identified data fields. To identify data fields in the web page they have used MDR algorithm which is presented in detail in [1]. After identifying appropriate fields in the given web page, they have used partial tree alignment technique to extract data from each field.

The main differences between this paper and the method that is presented in MDR algorithm are the followings:

- They have improved MDR algorithm using visual information to be able to identify individual data records more accurately.

- They have also proposed a new technique to extract content of each data field based on tree matching techniques.

They have tested their systems on different websites. They have shown the recall result of their method is about 98.18% and have obtained precision of 99.68%.


References

[1] Liu, B., Grossman, R. and Zhai, Y. “Mining data records from Web pages.” KDD-03, 2003.