Hwa et al, 1999
Citation
Rebecca Hwa. 1999. Supervised grammar induction using training data with limited constituent information. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics (ACL '99). Association for Computational Linguistics, Stroudsburg, PA, USA, 73-79.
Online version
http://acl.ldc.upenn.edu/P/P99/P99-1010.pdf
Summary
This is the paper which aims to improve the corpus-based grammar induction strategy when there are few labels in the training data. For inducing grammars from sparsely labeled training data, the paper propose an adaptation strategy, which produce grammars that parse almost as well as grammars induced from fully labeled corpora.
Since Penn Treebank like hand-parsed corpora is not easy to build for all kind of domain, this paper propose to adapt a grammar already trained on an old domain to the new domain. Adaptation can exploit the structural similarity between the two domains so that fewer labeled data might be needed to update the grammar to reflect the structure of the new domain.
The paper tries to understand the effect of the amounts and types of labeled data on the training process for both induction strategies. For example, how much training data need to be hand-lebeled? Must the parse tree for each sentence be fully specified? Are some linguistic constituents in the parse more informative than others?
To answer these questions, authors performed experiments that compare the parsing qualities of grammars induced under different training condition using both adaptation and direct induction. We vary the number of labeled brackets and the linguistic classes of the labeled brackets. The study is conducted on both a sim-ple Air Travel Information System (ATIS) cor-pus (Hemphill et al., 1990) and the more com-plex Wall Street Journal (WSJ) corpus (Marcuset al., 1993)