Selen writeup of Borkar et al.

From Cohen Courses
Jump to navigationJump to search

This is a review of Borkar_2001_Automatic_Segmentation_of_Text_Into_Structured_Records by user:Selen.

In this paper, they modify HMM to segment unformatted text, such as address data, and citation records to structured elements.

They provide a nested model for HMM, an outer loop for elements and for each element an inner loop to capture the characteristics of that element.

They also come up with a hierarchy to group symbols, i.e. chars and multi letters are grouped into words, digits are grouped into numbers and so on.

What I like about this paper is they in theory nested HMM is a good idea for tasks like this, and also bringing hierarchy to tokens is another good idea.

What I don't like about this paper is that, inner HMM is not a real HMM and hierarchical feature selection is not a real feature selection. Inner HMM looks more like deterministic finite automata, and it lacks the transitions to observed sequences. Feature selection is more like providing a group structure than to actually select features that have the bigger impact on the overall outcome. I think that's why they get only a slight modification when they apply feature selection.

  • The transitions to observed sequence is the emission table, which is part of the model (but not the picture). The state graph w/o emissions is a fairly standard way of displaying HMMS. - Wcohen 14:23, 24 September 2009 (UTC)

Another issue is that when they compare their results to the baseline method, they don't provide f measure, which would be useful since one is higher in precision low in recall, the other is high in recall but low in precision.