Hwa et al, 1999
Citation
Rebecca Hwa. 1999. Supervised grammar induction using training data with limited constituent information. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics (ACL '99). Association for Computational Linguistics, Stroudsburg, PA, USA, 73-79.
Online version
http://acl.ldc.upenn.edu/P/P99/P99-1010.pdf
Summary
This is the paper which aims to improve the corpus-based grammar induction strategy when there are few labels in the training data. For inducing grammars from sparsely labeled training data, the paper propose an adaptation strategy, which produce grammars that parse almost as well as grammars induced from fully labeled corpora.
Since Penn Treebank like hand-parsed corpora is not easy to build for all kind of domain, this paper propose to adapt a grammar already trained on an old domain to the new domain. Adaptation can exploit the structural similarity between the two domains so that fewer labeled data might be needed to update the grammar to reflect the structure of the new domain.
The paper tries to understand the effect of the amounts and types of labeled data on the training process for both induction strategies. For example, how much training data need to be hand-lebeled? Must the parse tree for each sentence be fully specified? Are some linguistic constituents in the parse more informative than others?
To answer these questions, authors performed experiments that compare the parsing qualities of grammars induced under different training condition using both adaptation and direct induction. We vary the number of labeled brackets and the linguistic classes of the labeled brackets. The study is conducted on both a simple Air Travel Information System (ATIS) corpus (Hemphill et al., 1990) and the more complex Wall Street Journal (WSJ) corpus (Marcuset al., 1993)
The results show that the training examples do not need to be fully parsed for either strategy, but adaptation produces better grammars than direct induction under the conditions of minimally labeled training data. For instance,the most informative brackets, which label constituents higher up in the parse trees, typically identifying complex noun phrases and sentential clauses, account for only 17% of all constituents in ATIS and 21% in WSJ. Trained on this type of label, the adapted grammars parse better than the directly induced grammars and almost as well as those trained on fully labeled data. Training on ATIS sentences labeled with higher-level constituent brackets, a directly induced grammar parses test sentences with 66% accuracy, whereas an adapted grammar parseswith 91% accuracy, which is only 2% lower than the score of a grammar induced from fully labeled training data. Training on WSJ sentences labeled with higher-level constituent brackets,a directly induced grammar parses with 70% accuracy, whereas an adapted grammar parseswith 72% accuracy, which is 6% lower than thescore of a grammar induced from fully labeled training data.