Liuy writeup of Borkar et al 2001

From Cohen Courses
Jump to navigationJump to search

This is a review of Borkar 2001 Automatic Segmentation of Text Into Structured Records by user:Liuy.

This paper solves the problems of extracting structure from html documents. The data is generated by human at varied time points. There is a lot of irregularity involved, which adds in the difficulty of the extraction task. A tooled called DATAMOLD is a probablistic tool on top of HMM incorporates multiple sources of information. Viterbi algorithm (QPQP), the typical algorithm for finding the most probable path, is explored to integrate multiple information sources into an optimization problem. In order to do the text segmentation, the structure of HMM is learned in a two-level nested model. To handle missing symbols, smoothing is done. Feature selection is conducted in a concept hierarchy.


I like this work, in particular because it is technically strong. Compared with the existing work, which are mostly rule-based IE systems, this work explores multiple external sources of information through a nested structure, instead of a subset of information by HMM. THis work control the eliminating of rules by the algorithm, instead of using heuristics. However, the amount of training data this procedure need to considerably large, from the experiments they show. I think it is necessary to develop strategies for active selection of examples.