Difference between revisions of "MDR algorithm"
PastStudents (talk | contribs) |
PastStudents (talk | contribs) |
||
(One intermediate revision by the same user not shown) | |||
Line 5: | Line 5: | ||
== Online version == | == Online version == | ||
− | [www.cs.uic.edu/~liub/publications/kdd2003-dataRecord.pdf | + | [[www.cs.uic.edu/~liub/publications/kdd2003-dataRecord.pdf]] |
== Summary == | == Summary == | ||
− | MDR is | + | MDR is a [[category::Method]] which is developed by Liu et al. to extract data from a given web page. The algorithm first finds regions of the HTML file that contain description of similar items (data records that needed to be extracted). These regions are called data region record. The second phase of the algorithm is to identify data fields in each extracted region. |
To be able to find regions of the HTML file that contain a data record, they first builds a DOM tree from the input HTML file. Then similar adjacent nodes in the DOM tree are found. The similarity of two nodes is measured using the edit distance function. All the nodes that are classified as similar and are adjacent in the DOM tree (i.e. have the same parent) are considered as the same region. | To be able to find regions of the HTML file that contain a data record, they first builds a DOM tree from the input HTML file. Then similar adjacent nodes in the DOM tree are found. The similarity of two nodes is measured using the edit distance function. All the nodes that are classified as similar and are adjacent in the DOM tree (i.e. have the same parent) are considered as the same region. | ||
The next step of algorithm is to find data fields in each extracted region. Each region doesn't necessary contain one data field and it may consists of several data fields. To be able to extract relevant field in each region they have used partial tree alignment technique. | The next step of algorithm is to find data fields in each extracted region. Each region doesn't necessary contain one data field and it may consists of several data fields. To be able to extract relevant field in each region they have used partial tree alignment technique. |
Latest revision as of 13:40, 26 October 2010
Citation
[1] Liu, B., Grossman, R. and Zhai, Y. “Mining data records from Web pages.” KDD-03, 2003.
Online version
www.cs.uic.edu/~liub/publications/kdd2003-dataRecord.pdf
Summary
MDR is a Method which is developed by Liu et al. to extract data from a given web page. The algorithm first finds regions of the HTML file that contain description of similar items (data records that needed to be extracted). These regions are called data region record. The second phase of the algorithm is to identify data fields in each extracted region.
To be able to find regions of the HTML file that contain a data record, they first builds a DOM tree from the input HTML file. Then similar adjacent nodes in the DOM tree are found. The similarity of two nodes is measured using the edit distance function. All the nodes that are classified as similar and are adjacent in the DOM tree (i.e. have the same parent) are considered as the same region.
The next step of algorithm is to find data fields in each extracted region. Each region doesn't necessary contain one data field and it may consists of several data fields. To be able to extract relevant field in each region they have used partial tree alignment technique.