Borkar 2001 Automatic Segmentation of Text Into Structured Records write up

The task is to parse postal addresses into meaningful pieces such as House no, building, street , zipcode etc.
The paper seems pretty thorough. They provide a comprehensive and convincing evaluation and show that the method they propose would work well.
I do not agree with their explanation of "inner HMM's". Based on my reading, it seems like they have introduced parallel paths to accommodate varying lengths of road names or building names etc. But I do not see the "Hidden" Markov model here. The inner HMM knows for "sure" the class to which the observations they emit belong to. For (eg) an Inner HMM for building names would know for sure that it is going to capture words that are building names.
- It's true, the states are not hidden an training time - only at test time. - Wcohen 14:14, 24 September 2009 (UTC)

I appreciate the hierarchical feature selection they performed. Using a decision tree like learner to learn the kind of features the text should be mapped to, is a principled way to approach the problem.

Correction: I think they could have explained their model by stating that they are creating a new set of states, where each state is represented by a pair (class-name,ith word captured). This would've made things much more cleaner.

Navigation menu