Nschneid writeup of Bellare 2009
This is Nschneid's review of Bellare_2009_generalized_expectation_criteria_for_bootstrapping_extractors_using_record_text_alignment
This is an interesting paper which addresses the information extraction task (for citations) as follows: Given a small database of similar records, model latent alignments from an input text (the full citation) to the values of fields for the (unobserved) output records. The existing database thus provides prototype outputs corresponding to possibly unknown inputs. Generalized expectation criteria are statistical regularities or soft constraints that provide guidance to the model in this oddly-unsupervised scenario. In other words, the "training" information is just some set {y} and features encoding some prior beliefs about what a "good" (x, y) pair would look like.
- The structure of the graphical model resembles IBM Model 1 (Brown et al., 1993) in which each target (record) word is assigned one or more source (text) words. The alignment is generated conditioned on both the record and text sequence, and therefore supports large sets of rich and non-independent features of the sequence pairs. Our model is trained without the need for labeled word alignments by using generalized expectation (GE) criteria (Mann and McCallum, 2008) that penalize the divergence of specific model expectations from target expectations. ... One example global criterion is that “an alignment exists between two orthographically similar words 95% of the time.” ... Another criterion for extraction can be “the word ‘EMNLP’ is always [i.e. 100% of the time] aligned with the record label booktitle”.
There are two CRFs: a zero–Markov-order alignment CRF (§3.1) which models these alignments, and a linear-chain extraction CRF (§3.3) which models tag bigrams given the citation. The alignment CRF is trained first, and used to compute the marginal distribution over labels for each text citation; this is used (in a not-quite-stacking sort of way) to train the extraction CRF. Training of each CRF is subject to the corresponding set of expectation criteria.
Where do the GE criteria come from? Labeled data, though only the expectations (and not the data itself) are used in training:
- As is common practice (Haghighi and Klein, 2006; Mann and McCallum, 2008), we simulate user-specified expectation criteria through statistics on manually labeled citation texts. For extraction criteria, we select for each label, the top N extraction features ordered by mutual information (MI) with that label. [etc.]
In total, 150 extraction constraints and 10 alignment constraints are used.
- I don't understand the details of how training works with GE.
- Training the alignment CRF, then using that to train the extraction CRF, was intriguing. Is that a sort of bootstrapping? Does it impose a bias to over-value information from alignments over the extraction portion? What if the extraction CRF were trained in the normal way, with supervised data?
- They cite the prototype-driven learning paper (Haghighi & Klein 2006), but the setup also reminded me somewhat of "Learning bilingual lexicons from monolingual corpora" (Haghighi et al. 2008), where a translation lexicon was learned in an unsupervised fashion using CCA.
Setup | Known data | Input | Latent | Output |
---|---|---|---|---|
Traditional supervised learning | {(x,y)} | x | -- | y |
Haghighi & Klein 2008 | {x1}, {x2} | -- | -- | x1 ↦ x2 |
This paper | {y} | x | x ↦ y* | ŷ |
I wonder how other work fits into this analysis...
- TODO: Read more closely the details of the model/evaluation (§4)