Sgardine writesup Borkar 2001

From Cohen Courses
Jump to navigationJump to search

This is a review of Borkar_2001_Automatic_Segmentation_of_Text_Into_Structured_Records by user:sgardine.

Synopsis

The paper presents DATAMOLD, a system for labeling fields in short text snippets (paradigmatically postal addresses) using HMMs. Tokens are generalized within a specified hierarchy to find the most accurate level of generalization. An outer HMM is trained on the ordering of elements within an instance; the ordering of tokens within an element is modeled by an inner HMM, one for each element-type. A partial database of geographical entities is also used; geographical entities also interdepend (e.g. there is a Pittsburg(h) PA and CA but not WY). The authors present a modification to Viterbi to force zero probability for invalid sequences (in the sense of geographic facts). Their trained model outperformed HMMs and rule-based systems.

Commentary

The authors repeatedly cast the Asian address dataset as more complex than the US address dataset, but do not present the nature of the difference. It could be that the bank's customers reside in various countries, indicating that the approach has problems dealing with heterogenous data. Alternatively, US addresses could simply be singularly homogenous and thus simply a relatively easy problem. Specifically, I'm wondering if you trained the model on both datasets together if it would still do better on the US addresses because they're just easier; I'm guessing so (especially since the rule-based Rapier system also did quite well on the US data).

The modification to Viterbi seems a little ad hoc and complicated. They state that it is optimal, and I guess I believe them but I'm not convinced that it maintains the time complexity of the elegant Viterbi. I'd wonder about all the backtracking, but perhaps it never comes up in practice.

Their use of outer and inner HMMs seems equivalent to a larger HMM with more states, but I like the separate training steps; it seems much clearer for presentation and for analyzing errors after the fact.