Mnduong writeup of Borkar et al.

From Cohen Courses
Jump to navigationJump to search

This is a review of Borkar_2001_Automatic_Segmentation_of_Text_Into_Structured_Records by user:mnduong.

  • This paper presented a method using HMMs to segment texts into structured elements. The HMM they used was not a "naive" one, with one state for each target element, but a nested one, in which the outer HMM's states represented the elements and the inner HMMs are used to learn the structure inside each element. The inner HMMs were trained together with the neighboring elements.
  • The system assumed a small amount of training data, which was used to train the model's parameters.
  • The system also classified the tokens into a hierarchy, from most general (All) to most specific (token identity). The system used a validation set to learn the optimal level to go down to for each branch.
  • The model used a modified version of Viterbi to allow for the integration of additional constraints, coming from a partial database.
  • The method was evaluated in two tasks, segmenting address data and bibliography data. It compared favorably against baselines, including a naive HMM with one state per element, a rule-based method (Rapier) and an "independent" HMM which trained the inner HMMs independently of other elements.

What I Did not Like:

  • The way the inner HMMs were trained wasn't discussed clearly enough.
  • The paper also did not specify any specific heuristics that was followed to prune the feature selection tree.

What I Liked:

  • I liked the fact that feature selection was done in a very principled way.
  • Except for the two points mentioned above, the paper gave a thorough and detailed discussion of every aspect that's involved.