Selen writeup of Borkar et al.

This is a review of Borkar_2001_Automatic_Segmentation_of_Text_Into_Structured_Records by user:Selen.

In this paper, they modify HMM to segment unformatted text, such as address data, and citation records to structured elements.

They provide a nested model for HMM, an outer loop for elements and for each element an inner loop to capture the characteristics of that element.

They also come up with a hierarchy to group symbols, i.e. chars and multi letters are grouped into words, digits are grouped into numbers and so on.

What I like about this paper is they in theory nested HMM is a good idea for tasks like this, and also bringing hierarchy to tokens is another good idea.

What I don't like about this paper is that, inner HMM is not a real HMM and hierarchical feature selection is not a real feature selection. Inner HMM looks more like deterministic finite automata, and it lacks the transitions to observed sequences. Feature selection is more like providing a group structure than to actually select features that have the bigger impact on the overall outcome. I think that's why they get only a slight modification when they apply feature selection.

The transitions to observed sequence is the emission table, which is part of the model (but not the picture). The state graph w/o emissions is a fairly standard way of displaying HMMS. - Wcohen 14:23, 24 September 2009 (UTC)

Another issue is that when they compare their results to the baseline method, they don't provide f measure, which would be useful since one is higher in precision low in recall, the other is high in recall but low in precision.

Selen writeup of Borkar et al.

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools