MDR algorithm

From Cohen Courses
Jump to navigationJump to search

Citation

[1] Liu, B., Grossman, R. and Zhai, Y. “Mining data records from Web pages.” KDD-03, 2003.

Online version

www.cs.uic.edu/~liub/publications/kdd2003-dataRecord.pdf

Summary

MDR is a Method which is developed by Liu et al. to extract data from a given web page. The algorithm first finds regions of the HTML file that contain description of similar items (data records that needed to be extracted). These regions are called data region record. The second phase of the algorithm is to identify data fields in each extracted region.

To be able to find regions of the HTML file that contain a data record, they first builds a DOM tree from the input HTML file. Then similar adjacent nodes in the DOM tree are found. The similarity of two nodes is measured using the edit distance function. All the nodes that are classified as similar and are adjacent in the DOM tree (i.e. have the same parent) are considered as the same region.

The next step of algorithm is to find data fields in each extracted region. Each region doesn't necessary contain one data field and it may consists of several data fields. To be able to extract relevant field in each region they have used partial tree alignment technique.