Difference between revisions of "MDR algorithm"

From Cohen Courses
Jump to navigationJump to search
Line 9: Line 9:
 
== Summary ==
 
== Summary ==
  
[[UsesMethod:: MDR algorithm || MDR]]
+
MDR is an algorithm which is developed by Liu et al. to extract data from a given web page. The algorithm first finds regions of the HTML file  that contain description of similar items (data records that needed to be extracted). These regions are called data region record. The second phase of the algorithm is to identify data fields in each extracted region.
  
This paper studies the problem of extracting structured records from semi structured web pages [[Category::problem]] which has been recently studied in several researches. Most of the techniques in extracting structured information from the Web are limited by either the following two limitations: they require human labeling of many web pages or they have made many assumptions that are not applicable to many web sites.  
+
To be able to find regions of the HTML file that contain a data record, they first builds a DOM tree from the input HTML file. Then similar adjacent nodes in the DOM tree are found. The similarity of two nodes is measured using the edit distance function. All the nodes that are classified as similar and are adjacent in the DOM tree (i.e. have the same parent) are considered as the same region.
  
This paper present a novel technique which doesn't have the above limitations. The technique has two phases: 1- identifying data fields in the input web page, and 2- extracting data from the identified data fields. To identify data fields in the web page they have used [[UsesMethod:: MDR algorithm]] which is presented in detail in [1]. After identifying appropriate fields in the given web page, they have used partial tree alignment technique to extract data from each field.
+
The next step of algorithm is to find data fields in each extracted region. Each region doesn't necessary contain one data field and it may consists of several data fields. To be able to extract relevant field in each region they have used partial tree alignment technique.
 
 
They have tested their systems on different websites. They have shown that their technique can achieve accuracy of 98.18% in recall and 99.68% in precision.
 

Revision as of 09:25, 9 October 2010

Citation

[1] Liu, B., Grossman, R. and Zhai, Y. “Mining data records from Web pages.” KDD-03, 2003.

Online version

KDD-03

Summary

MDR is an algorithm which is developed by Liu et al. to extract data from a given web page. The algorithm first finds regions of the HTML file that contain description of similar items (data records that needed to be extracted). These regions are called data region record. The second phase of the algorithm is to identify data fields in each extracted region.

To be able to find regions of the HTML file that contain a data record, they first builds a DOM tree from the input HTML file. Then similar adjacent nodes in the DOM tree are found. The similarity of two nodes is measured using the edit distance function. All the nodes that are classified as similar and are adjacent in the DOM tree (i.e. have the same parent) are considered as the same region.

The next step of algorithm is to find data fields in each extracted region. Each region doesn't necessary contain one data field and it may consists of several data fields. To be able to extract relevant field in each region they have used partial tree alignment technique.