Difference between revisions of "Web Data Extraction Based on Partial Tree Alignment"

From Cohen Courses
Jump to navigationJump to search
Line 9: Line 9:
 
== Summary ==
 
== Summary ==
  
This paper studies the [[Category::problem]] of extracting structured records from semi structured web pages which has been recently studied in several researches. Most of the techniques in extracting structured information from the Web are limited by either of the following two limitations: they require human labeling of many web pages or they have made many assumptions that are not applicable to many web sites.  
+
This [[Category::paper]] studies the problem of [[AddressesProblem::extracting data from semi-structured web pages]] which has been recently studied in several researches. Most of the techniques in extracting structured information from the Web are limited by either of the following two limitations: they require human labeling of many web pages or they have made many assumptions that are not applicable to many web sites.  
  
 
This paper presents a novel technique which doesn't have the above limitations. The method has two phases: 1- identifying data fields in the input web page, and 2- extracting data from the identified data fields. To identify data fields in the web page they have used [[UsesMethod:: MDR algorithm]] which is presented in detail in [1]. After identifying appropriate fields in the given web page, they have used partial tree alignment technique to extract data from each field.  
 
This paper presents a novel technique which doesn't have the above limitations. The method has two phases: 1- identifying data fields in the input web page, and 2- extracting data from the identified data fields. To identify data fields in the web page they have used [[UsesMethod:: MDR algorithm]] which is presented in detail in [1]. After identifying appropriate fields in the given web page, they have used partial tree alignment technique to extract data from each field.  

Revision as of 14:38, 26 October 2010

Citation

Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW. (2005) 76–85.

Online version

[|Zhai-WWW05]

Summary

This paper studies the problem of extracting data from semi-structured web pages which has been recently studied in several researches. Most of the techniques in extracting structured information from the Web are limited by either of the following two limitations: they require human labeling of many web pages or they have made many assumptions that are not applicable to many web sites.

This paper presents a novel technique which doesn't have the above limitations. The method has two phases: 1- identifying data fields in the input web page, and 2- extracting data from the identified data fields. To identify data fields in the web page they have used MDR algorithm which is presented in detail in [1]. After identifying appropriate fields in the given web page, they have used partial tree alignment technique to extract data from each field.

They have tested their systems on different websites. They have shown the recall result of their method is about 98.18% and have obtained precision of 99.68%.


References

[1] Liu, B., Grossman, R. and Zhai, Y. “Mining data records from Web pages.” KDD-03, 2003.