Difference between revisions of "Hwa et al, 1999"

From Cohen Courses
Jump to navigationJump to search
Line 6: Line 6:
  
 
== Summary ==
 
== Summary ==
This is the paper [[Category::paper]]
+
This is the [[Category::paper]] which aims to improve the corpus-based grammar induction strategy when there are few labels in the training data. For inducing grammars from sparsely labeled training data, the paper propose an adaptation strategy, which produce grammars that parse almost as well as grammars induced from fully labeled corpora.
 +
 
 +
Since Penn Treebank like hand-parsed corpora is not easy to build for all kind of domain, this paper propose to adapt a grammar already trained on an old domain to the new domain. Adaptation can exploit the structural similarity between the two domains so that fewer labeled data might be needed to update the grammar to reflect the structure of the new domain.
 +
 
 +
The paper tries to understand the effect of the amounts and types of labeled data on the training process for both induction strategies. For example, how much training data need to be hand-lebeled? Must the parse tree for each sentence be fully specified? Are some linguistic constituents in the parse more informative than others?

Revision as of 21:47, 1 November 2011

Citation

Rebecca Hwa. 1999. Supervised grammar induction using training data with limited constituent information. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics (ACL '99). Association for Computational Linguistics, Stroudsburg, PA, USA, 73-79.

Online version

http://acl.ldc.upenn.edu/P/P99/P99-1010.pdf

Summary

This is the paper which aims to improve the corpus-based grammar induction strategy when there are few labels in the training data. For inducing grammars from sparsely labeled training data, the paper propose an adaptation strategy, which produce grammars that parse almost as well as grammars induced from fully labeled corpora.

Since Penn Treebank like hand-parsed corpora is not easy to build for all kind of domain, this paper propose to adapt a grammar already trained on an old domain to the new domain. Adaptation can exploit the structural similarity between the two domains so that fewer labeled data might be needed to update the grammar to reflect the structure of the new domain.

The paper tries to understand the effect of the amounts and types of labeled data on the training process for both induction strategies. For example, how much training data need to be hand-lebeled? Must the parse tree for each sentence be fully specified? Are some linguistic constituents in the parse more informative than others?