Grenager et al, ACL 2005: Unsupervised Learning of Field Segmentation Models for Information Extraction
T. Grenager, D. Klein and C. Manning. Unsupervised Learning of Field Segmentation Models for Information Extraction, Proceedings of the 43rd Annual Meeting of the ACL, pp. 371-378, Ann Arbor, June 2005.
This paper addresses the task of field segmentation, i.e. segmenting a document into fields and labeling them. Specially, the paper investigates two domains: bibliographic citations and classified advertisements for apartment rentals. The popular HMM is used as the modeling tool.
The use of supervised learning methods is limited by the diversity of domains and the lack of domain-specific labeled training data. Therefore the authors focus on unsupervised learning.
Challenge of Unsupervised Learning
Pure unsupervised learning without any constraints does not yield meaningful segmentation results. This is because of the existence of multiple levels of structure in documents: the desired field structure, as well as lower-level POS structure. Unconstrained unsupervised learning usually learns a mixture of the multiple levels of structures. In order to lead the model to learn the desired level of structure, it is necessary to constrain the model with prior knowledge.
The first constraint imposed on the model is a fixed transition matrix with a dominant diagonal:
where is the number of states (fields). The dominant diagonal discourages transitioning between states, resulting in longer fields (the expected field length is ), therefore making the model more suitable to learn high-level structures.
Some other undesirable phenomena observed in the output of unconstrained unsupervised learning are also addressed. One of these problems is that the emission distributions of the states are polluted by punctuation and function words devoid of content, and some states are even devoted to such non-content words. To address this problem, a "common word distribution" for non-content words is provided (either given or learned), and the emission distributions of the states are modeled as a mixture of the shared "common word distribution" and state-specific distributions (called "hierarchical mixture emission models"). The state-specific distributions can then be devoted to modeling the content words.
Another problem with unconstrained unsupervised learning is that the model isn't good at identifying field boundaries, even though they are often clearly marked with tokens such as punctuations. To achieve better accuracy at field boundaries, the model is enriched so that each field is modeled by two states: a non-final state which has an emission distribution as before, and a final state which emits boundary tokens according to a shared "boundary token distribution".
The paper also investigates a bit on semi-supervised learning, i.e. augmenting the unlabeled training data with a small number of labeled examples.
For the bibliographic citation domain, the dataset from McCallum et al, IJCAI 1999 is used. It consists of 500 annotated citations, split into 300 training, 100 development, and 100 testing examples.
For the classified advertisement domain, the dataset consists of 8,767 classified advertisements for apartment rentals in the San Fransisco Bay Area downloaded in June 2004 from the Craigslist website, with 302 of them annotated. It is split into 102 labeled training, 100 development, 100 testing, and 8,465 unlabeled training examples.
The evaluation criterion used is per-token accuracy, following McCallum et al, IJCAI 1999. This is not a very standard criterion because "it leads to a lower penalty for boundary errors, and allows long fields to contribute more to accuracy than short ones."
The main results are summarized in Table 1. The "baseline" refers to assigning all tokens to the most frequent field. The "segment and cluster" refers to segmenting the document crudely at punctuations, and clustering the segments using unsupervised Naive Bayes.
It can be seen that the diagonal transition matrix boosts the accuracy significantly, to a level close to supervised training. Hierarchical mixture emission models and boundary models produce marginal improvement for the advertisement domain, but the effect is dubious for the citation domain.
The diagonal transition matrix, while encouraging staying at the same state, makes the probability distribution of transitioning to other states uniform. This makes the model unable to capture sequential relationships between fields such as "AUTHOR is usually followed by TITLE". Instead of using a fixed transition matrix, it may be better to impose a prior that heavily penalizes large off-diagonal elements, but still keeps the matrix elements free parameters.