Borkar, SIGMOD 2001
From Cohen Courses
Jump to navigationJump to searchCitation
Borkar, V., Deshmukh, K., and Sarawagi, S. 2001. Automatic segmentation of text into structured records. SIGMOD Rec. 30, 2 (Jun. 2001), 175-186.
Online version
Summary
In this paper, the authors used HMM approach to extract structured information from text by doing text segmentation.
- Instead of using the naive HMM which represents every entity with a state, they proposed a nested HMM model where they use an inner HMM for each entity. In this two level model the outer HMM helps with the order among the entities and inner HMMs deal with entities with different length.
- The authors used a hierarchical feature selection method where they started with the original symbols and pruned till they found the right level of taxonomy by using validation set.
- They used standard training methods to estimate the probabilities and absolute smoothing to differentiate known and unknown symbols.
- The authors modified the Viterbi algorithm in order to restrict Viterbi search to explore paths that are invalid due to some constraints.
The authors experimented on three real-life data sets and found that their nested model performs better than naive HMM or Independent HMM which uses one HMM for each entity. The results show that at overall HMMs perform better than rule learning systems which has a very low recall when compared to HMMs. The paper also proves that HMM are fast learners which makes them reach their maximum accuracy with small number of documents.