Philgoo Han writeup of Borkar, Deshmukh and Sarawagi
This is a review of Borkar_2001_Automatic_Segmentation_of_Text_Into_Structured_Records by user:Ironfoot.
This paper is about using Hidden Markov Models for text segmentation, particularly address and bibliography records. The best part of the suggested Datamold is that it works on really small training data. In order to reach that goal they applied a trick to the naive HMM which is nesting the HMM structure. This seems to be the main reason of overcomming the high variance of small size data(of course together with smoothing).
One more special thing is that they used an hierarchical taxomony model to tuned up accuracy. This did enhance the results but however was very adhoc.
It seems in many cases applying hierarchy or nesting in the structure provides a new opportunity of enhancement.