Wka writeup of Borkar et al 2001

From Cohen Courses
Jump to navigationJump to search

This is a review of Borkar 2001 Automatic Segmentation of Text Into Structured Records by user:wka.

The authors introduce DataMold, a tool for segmentation (structure extraction) of text that uses 2-level nested HMMs, and that achieves superior results using little training data. DataMold implements a learning model that can exploit cues from several sources.

The outer HMM captures a segment, handling the ordering relationships amongst segments, and the inner one captures the segment's constituent parts. A parallel-path structure for inner HMMs is used to capture multi-length segments of the same type, and each inner HMM is learnt in conjunction with its neighbors.

In learning transition and emission probabilities, they use absolute discounting for smoothing instead of Laplace smoothing to better handle the probabilities of unseen symbols.

DataMold performs hierarchical feature selection by pruning a taxonomy tree. It also integrates a database of semantic relationships amongst symbols of different elements, which constrains allowed combination values. They modify the standard Viterbi algorithm used with HMMs to restrict exploring paths that violate these constraints, while maintaining the optimality of the obtained path.

Results of applying the tool are shown for 2 domains: addresses and bibliography records. With little training data, the nested HMMs reach 91% accuracy with only 10 training addresses; overall, 10% of available data used for training reached within 1% of the peak accuracy, hence the conclusion that HMMs are "fast learners". DataMold is shown to beat other models especially in non-regular data like international addresses.


Some comments:

  • It's an interesting paper for introducing the use of nested HMMs for text segmentation, and for being a informative read about its background material as well.
  • They could have improved it by exploring why feature selection using a hierarchical taxonomy tree had only little effect on some datasets.