Borkar et al, SIGMOD 2001
Citation
Borkar, V., Deshmukh, K., and Sarawagi, S. 2001. Automatic segmentation of text into structured records. SIGMOD Rec. 30, 2 (Jun. 2001), 175-186.
Online version
Summary
In this paper, the authors used HMM approach to extract structured information from text by doing text segmentation.
- Instead of using the naive HMM which represents every entity with a state, they proposed a nested HMM model where they use an inner HMM for each entity. In this two level model the outer HMM helps with the order among the entities and inner HMMs deal with entities with different length.
- The authors used a hierarchical feature selection method where they started with the original symbols and pruned till they found the right level of taxonomy by using validation set.
- They used standard training methods to estimate the probabilities and absolute smoothing to differentiate known and unknown symbols.
- The authors modified the Viterbi algorithm in order to restrict Viterbi search to explore paths that are invalid due to some constraints.
The authors experimented on three real-life data sets (US addresses, Student addresses, Company addresses) and found that their nested model performs better than naive HMM or Independent HMM which uses one HMM for each entity. The results show that at overall HMMs perform better than rule learning systems which has a very low recall when compared to HMMs. The paper also proves that HMM are fast learners which makes them reach their maximum accuracy with small number of documents.
Related Papers
An earlier paper Seymore et al, AAAI 1999, also used HMM and represented different type of entities in the same model. They showed that an HMM with multiple states per entity outperforms an HMM with one state per entity.