Apappu writeup of Borkar et al.

From Cohen Courses
Jump to navigationJump to search

This is a review of the paper Borkar_2001_Automatic_Segmentation_of_Text_Into_Structured_Records by user:apappu.

  • Automatically segment text into structured elements like House addresses, street names etc.
  • They start with a baseline like Naive model of HMM that fails to address sequential relationship amongst tokens.
  • Building up on the drawbacks of the naive model, authors propose a two level HMM model to address this task, where an inner HMM (represented as single state in outer HMM) captures internal structure of an element, on the other hand outer HMM looks for ordering relationship between elements.
  • The inner HMM tries to deal with parallel paths to observe variable number of tokens in an element. I am curious to know what exactly makes an inner HMM a learning structure, because it already knows what class of data it is going to observe ?
  • The way authors addressed the problem of feature selection is interesting but the results (Figure 13) doesn't talk substantial improvement in accuracy despite certain features were collapsed/merged. [On side note: a slightly recent paper by Dan Klein with regard to parsing talks about split-merge approach that iteratively splits grammar symbols based on "gain" at each level shows reduction in parsing time]

[1]

  • I am curious to know whether adding more parallel-path inner HMMs (corresponding to more elements) is computationally intensive ?
    • Not really - it adds a few more states but not many edges to the HMM, so inference is still efficient. In principle more data could be needed since the data is fragmented further -- Wcohen 14:12, 24 September 2009 (UTC)
  • I liked the way they coupled higher level semantic relationship information with Viterbi.