Yandongl writeup of Borkar 2001

From Cohen Courses
Jump to navigationJump to search

This is a review of Borkar 2001 Automatic Segmentation of Text Into Structured Records by user:Yandongl.

This is an interesting paper because the idea of nested HMMs is(was) really new. So the goal is automatic segmentation of unstructured text (or semi- since it's roughly the structure you know before-hand). Two types of datasets are introduced - postal address/bibliography. Then following is simply standard HMM training/testing procedures. Nothing special.

What's interesting is the Nested model, which tries to model the different numbers of tokens for one element such as "New York", meaning this model tries to capture the whole element at one start in the outer HMM Model. But since authors again split the dataset into finer level, the training set for each path in the inner HMM becomes even small. To solve this, some paths are merged (same end/beginning element).

Feature selection - Hierarchical taxonomy model is utilized because finer categorization sometimes hurts the performance.

Experiments show that DATAMOLD outperforms all others includeing naive HMM/Independent HMM, but not significantly, compared to Naive HMM in quite some cases. And I believe this nested HMM model is much more time-consuming than naive HMM. So whethre or not it will bring much benefit remains doubtful.