Difference between revisions of "Accurate Unlexicalized Parsing"
(10 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | |||
− | |||
== Citation == | == Citation == | ||
Line 7: | Line 5: | ||
== Online Version == | == Online Version == | ||
− | An online version of this paper is available here [http://www.ark.cs.cmu.edu/LS2/images/d/da/KleinManning2003.pdf] | + | An online version of this [[Category::Paper|paper]] is available here [http://www.ark.cs.cmu.edu/LS2/images/d/da/KleinManning2003.pdf] |
== Summary == | == Summary == | ||
− | One of the concepts in PCFG [[AddressesProblem::Parsing]] that has become fairly standard is the concept of ''lexicalizing'' the grammar; that is, annotating each node with a head word. The authors of the paper focus on using unlexicalized grammars, and attempt to exploit structural context in an effort to produce improved results that can potentially be combined with lexicalization to improve the state-of-the-art. | + | One of the concepts in PCFG [[AddressesProblem::Parsing|parsing]] that has become fairly standard is the concept of ''lexicalizing'' the grammar; that is, annotating each node with a head word. The authors of the paper focus on using unlexicalized grammars, and attempt to exploit structural context in an effort to produce improved results that can potentially be combined with lexicalization to improve the state-of-the-art. They apply some linguistic knowledge to modify the structure of the grammars as well. |
== Lexicalized vs Unlexicalized == | == Lexicalized vs Unlexicalized == | ||
Line 25: | Line 23: | ||
The first step that they take is to ''markovize'' the rules. In markovization, we examine the vertical and horizontal ancestors of the current node. It is logical to take <math>v</math> vertical, and <math>h</math> horizontal ancestors. In the original TreeBank, <math>v = 1, h = \infty</math>, because only the current node is considered vertically (i.e., no parent node history stored), and the node is based on all the nodes in the rewrite rule that came before it. | The first step that they take is to ''markovize'' the rules. In markovization, we examine the vertical and horizontal ancestors of the current node. It is logical to take <math>v</math> vertical, and <math>h</math> horizontal ancestors. In the original TreeBank, <math>v = 1, h = \infty</math>, because only the current node is considered vertically (i.e., no parent node history stored), and the node is based on all the nodes in the rewrite rule that came before it. | ||
+ | In their experiments, they found that <math>v \leq 2, h \leq 2</math> was optimal, because, while it did not perform the best of the tests they did, it had significantly fewer symbols which would allow for them to do further rule splitting and still keep the grammar size managable. It is essentially the best compromise between improved performance from markovization, and grammar complexity. | ||
+ | |||
+ | === Tag Splitting === | ||
+ | |||
+ | The remainder of the methods involve splitting the grammar up and adding additional annotations. | ||
+ | |||
+ | They added various annotations for: | ||
+ | |||
+ | * '''UNARY-Internal''': Any nonterminal with only one child | ||
+ | * '''UNARY-External''': Any nonterminal that is an only child (discarded) | ||
+ | * '''UNARY-DT''': Any determiner that is an only child (subset of Unary-External that provided a benefit) | ||
+ | * '''UNARY-RB''': Any adverb that is an only child (subset of Unary-External that provided a benefit) | ||
+ | * '''TAG-PA''': They included an annotation for each node's parent node | ||
+ | * '''SPLIT-IN''': They split the IN annotation up based on linguistic knowledge | ||
+ | * '''SPLIT-CC''': They split conjunctions into but/and and others (because those have different linguistic properties) | ||
+ | * '''SPLIT-AUX''': They split auxiliary verbs into "be" and "have" | ||
+ | * '''TMP-NP''': Keeps the temporal tag on noun phrases | ||
+ | * '''GAPPED-S''': Marks S nodes that have an empty subject | ||
+ | * '''POSS-NP''': Marks possessive NPs | ||
+ | * '''SPLIT-VP''': Splits the VP annotation to mark with the head word. They then merge all the finite forms into VBF, to avoid doing lexicalization | ||
+ | ** Comment: This seems almost like lexicalization... | ||
+ | * '''BASE-NP''': Annotate non-recursive NPs | ||
+ | * '''DOMINATES-V''': Mark nodes that dominate the verb | ||
+ | * '''RIGHT-REC-NP''': Mark all NPs that contain an NP as a right most descendant | ||
+ | ** This is an attempt to model depth | ||
+ | |||
+ | Note that all of these annotations are not caused by lexicalization (with the possible exception of SPLIT-VP), but more based on linguistic understanding. | ||
+ | |||
+ | == Experiments == | ||
+ | |||
+ | They do the parsing evaluation using the [[UsesDataset::Penn Treebank]]. They use sections 2-21 for training and 22 for development. Section 23 is used as the test set. They report the <math>F_{1}</math> scores for each additional component rule added on the development set, as well as the final score on the test set. | ||
+ | |||
+ | In this case, the baseline results are obtained with a markovization of <math>v\leq2,h\leq2</math>. The changes in <math>F_{1}</math> are given with the annotation applied both cumulatively, and also in isolation. | ||
− | + | Results on the development set: | |
+ | |||
+ | {| class="wikitable" border="1" | ||
+ | |- | ||
+ | ! Annotation | ||
+ | ! Size | ||
+ | ! <math>F_{1}</math> | ||
+ | ! <math>\Delta F_{1}</math> (total) | ||
+ | ! <math>\Delta F_{1}</math> (individual) | ||
+ | |- | ||
+ | ! Baseline | ||
+ | | 7619 | ||
+ | | 77.77 | ||
+ | | N/A | ||
+ | | N/A | ||
+ | |- | ||
+ | ! UNARY-INTERNAL | ||
+ | | 8065 | ||
+ | | 78.32 | ||
+ | | 0.55 | ||
+ | | 0.55 | ||
+ | |- | ||
+ | ! UNARY-DT | ||
+ | | 8066 | ||
+ | | 78.48 | ||
+ | | 0.71 | ||
+ | | 0.17 | ||
+ | |- | ||
+ | ! UNARY-RB | ||
+ | | 8069 | ||
+ | | 78.86 | ||
+ | | 1.09 | ||
+ | | 0.43 | ||
+ | |- | ||
+ | ! TAG-PA | ||
+ | | 8520 | ||
+ | | 80.62 | ||
+ | | 2.85 | ||
+ | | 2.52 | ||
+ | |- | ||
+ | ! SPLIT-IN | ||
+ | | 8541 | ||
+ | | 81.19 | ||
+ | | 3.42 | ||
+ | | 2.12 | ||
+ | |- | ||
+ | ! SPLIT-AUX | ||
+ | | 9034 | ||
+ | | 81.66 | ||
+ | | 3.89 | ||
+ | | 0.57 | ||
+ | |- | ||
+ | ! SPLIT-CC | ||
+ | | 9190 | ||
+ | | 81.69 | ||
+ | | 3.92 | ||
+ | | 0.12 | ||
+ | |- | ||
+ | ! SPLIT-% | ||
+ | | 9255 | ||
+ | | 81.81 | ||
+ | | 4.04 | ||
+ | | 0.14 | ||
+ | |- | ||
+ | ! TMP-NP | ||
+ | | 9594 | ||
+ | | 82.25 | ||
+ | | 4.48 | ||
+ | | 1.07 | ||
+ | |- | ||
+ | ! GAPPED-S | ||
+ | | 9741 | ||
+ | | 82.28 | ||
+ | | 4.51 | ||
+ | | 0.17 | ||
+ | |- | ||
+ | ! POSS-NP | ||
+ | | 9820 | ||
+ | | 83.06 | ||
+ | | 5.29 | ||
+ | | 0.28 | ||
+ | |- | ||
+ | ! SPLIT-VP | ||
+ | | 10499 | ||
+ | | 85.72 | ||
+ | | 7.95 | ||
+ | | 1.36 | ||
+ | |- | ||
+ | ! BASE-NP | ||
+ | | 11660 | ||
+ | | 86.04 | ||
+ | | 8.27 | ||
+ | | 0.73 | ||
+ | |- | ||
+ | ! DOMINATES-V | ||
+ | | 14097 | ||
+ | | 86.91 | ||
+ | | 9.14 | ||
+ | | 1.42 | ||
+ | |- | ||
+ | ! RIGHT-REC-NP | ||
+ | | 15276 | ||
+ | | 87.04 | ||
+ | | 9.27 | ||
+ | | 1.94 | ||
+ | |} | ||
+ | |||
+ | They get a final test set <math>F_{1}</math> of 68.32%, which is close to that of early lexicalized parsers, but a bit lower than more recent lexicalized parsers. | ||
+ | |||
+ | == Conclusion == | ||
+ | |||
+ | Using only some grammar structure manipulations, they were able to get vast improvement over the baseline. It's possible there could be some intelligent way to leverage the linguistic structure that they use and combine it with lexical features. | ||
+ | |||
+ | == Related Work == | ||
+ | |||
+ | Many strides have been made in unlexicalized parsing since this paper (many of which are by the authors of this paper). | ||
+ | |||
+ | * [[RelatedPaper::Improved Inference for Unlexicalized Parsing, Slav Petrov and Dan Klein, HLT-NAACL, 2007]] - Improves upon (clearly) the inference of unlexicalized parsing | ||
+ | * [[RelatedPaper::Learning Accurate, Compact, and Interpretable Tree Annotation, S. Petrov, L. Barrett, R. Thibaux, D. Klein, ACL 2006]] - In this paper, they learn tree annotations to actually produce better than lexicalized results | ||
+ | * [[RelatedPaper::Efficient, Feature-based, Conditional Random Field Parsing, J. R. Finkel, A. Kleeman, and C. D. Manning, ACL 2008]] - Another improvement that is similar, but instead uses a CRF feature-based approach to improve over a fully lexicalized case. |
Latest revision as of 19:58, 1 November 2011
Contents
Citation
"Accurate Unlexicalized Parsing", D. Klein and C. D. Manning, ACL 2003
Online Version
An online version of this paper is available here [1]
Summary
One of the concepts in PCFG parsing that has become fairly standard is the concept of lexicalizing the grammar; that is, annotating each node with a head word. The authors of the paper focus on using unlexicalized grammars, and attempt to exploit structural context in an effort to produce improved results that can potentially be combined with lexicalization to improve the state-of-the-art. They apply some linguistic knowledge to modify the structure of the grammars as well.
Lexicalized vs Unlexicalized
Some of the methods they use seem very similar to what goes on in a lexicalized PCFG. An important differentiation between the two comes in about content words vs function words. The argument is made that linguists will often differentiate between annotating nodes that have a different functional head, as opposed to a different content head. That is, the authors are attempting to leverage the linguistic structure of a phrase, not leverage past information obtained from training data. So is structurally different from from a linguistic standpoint, but to annotate using content words, like vs is what the authors are seeking to avoid.
Method
They do an iterative set of non-lexicalization improvements on the grammars and report the improving F1 scores. They use the standard CKY parsing algorithm for running their experiments. The grammar probabilities are given as unsmoothed maximum likelihood probabilities.
Markovization
The first step that they take is to markovize the rules. In markovization, we examine the vertical and horizontal ancestors of the current node. It is logical to take vertical, and horizontal ancestors. In the original TreeBank, , because only the current node is considered vertically (i.e., no parent node history stored), and the node is based on all the nodes in the rewrite rule that came before it.
In their experiments, they found that was optimal, because, while it did not perform the best of the tests they did, it had significantly fewer symbols which would allow for them to do further rule splitting and still keep the grammar size managable. It is essentially the best compromise between improved performance from markovization, and grammar complexity.
Tag Splitting
The remainder of the methods involve splitting the grammar up and adding additional annotations.
They added various annotations for:
- UNARY-Internal: Any nonterminal with only one child
- UNARY-External: Any nonterminal that is an only child (discarded)
- UNARY-DT: Any determiner that is an only child (subset of Unary-External that provided a benefit)
- UNARY-RB: Any adverb that is an only child (subset of Unary-External that provided a benefit)
- TAG-PA: They included an annotation for each node's parent node
- SPLIT-IN: They split the IN annotation up based on linguistic knowledge
- SPLIT-CC: They split conjunctions into but/and and others (because those have different linguistic properties)
- SPLIT-AUX: They split auxiliary verbs into "be" and "have"
- TMP-NP: Keeps the temporal tag on noun phrases
- GAPPED-S: Marks S nodes that have an empty subject
- POSS-NP: Marks possessive NPs
- SPLIT-VP: Splits the VP annotation to mark with the head word. They then merge all the finite forms into VBF, to avoid doing lexicalization
- Comment: This seems almost like lexicalization...
- BASE-NP: Annotate non-recursive NPs
- DOMINATES-V: Mark nodes that dominate the verb
- RIGHT-REC-NP: Mark all NPs that contain an NP as a right most descendant
- This is an attempt to model depth
Note that all of these annotations are not caused by lexicalization (with the possible exception of SPLIT-VP), but more based on linguistic understanding.
Experiments
They do the parsing evaluation using the Penn Treebank. They use sections 2-21 for training and 22 for development. Section 23 is used as the test set. They report the scores for each additional component rule added on the development set, as well as the final score on the test set.
In this case, the baseline results are obtained with a markovization of . The changes in are given with the annotation applied both cumulatively, and also in isolation.
Results on the development set:
Annotation | Size | (total) | (individual) | |
---|---|---|---|---|
Baseline | 7619 | 77.77 | N/A | N/A |
UNARY-INTERNAL | 8065 | 78.32 | 0.55 | 0.55 |
UNARY-DT | 8066 | 78.48 | 0.71 | 0.17 |
UNARY-RB | 8069 | 78.86 | 1.09 | 0.43 |
TAG-PA | 8520 | 80.62 | 2.85 | 2.52 |
SPLIT-IN | 8541 | 81.19 | 3.42 | 2.12 |
SPLIT-AUX | 9034 | 81.66 | 3.89 | 0.57 |
SPLIT-CC | 9190 | 81.69 | 3.92 | 0.12 |
SPLIT-% | 9255 | 81.81 | 4.04 | 0.14 |
TMP-NP | 9594 | 82.25 | 4.48 | 1.07 |
GAPPED-S | 9741 | 82.28 | 4.51 | 0.17 |
POSS-NP | 9820 | 83.06 | 5.29 | 0.28 |
SPLIT-VP | 10499 | 85.72 | 7.95 | 1.36 |
BASE-NP | 11660 | 86.04 | 8.27 | 0.73 |
DOMINATES-V | 14097 | 86.91 | 9.14 | 1.42 |
RIGHT-REC-NP | 15276 | 87.04 | 9.27 | 1.94 |
They get a final test set of 68.32%, which is close to that of early lexicalized parsers, but a bit lower than more recent lexicalized parsers.
Conclusion
Using only some grammar structure manipulations, they were able to get vast improvement over the baseline. It's possible there could be some intelligent way to leverage the linguistic structure that they use and combine it with lexical features.
Related Work
Many strides have been made in unlexicalized parsing since this paper (many of which are by the authors of this paper).
- Improved Inference for Unlexicalized Parsing, Slav Petrov and Dan Klein, HLT-NAACL, 2007 - Improves upon (clearly) the inference of unlexicalized parsing
- Learning Accurate, Compact, and Interpretable Tree Annotation, S. Petrov, L. Barrett, R. Thibaux, D. Klein, ACL 2006 - In this paper, they learn tree annotations to actually produce better than lexicalized results
- Efficient, Feature-based, Conditional Random Field Parsing, J. R. Finkel, A. Kleeman, and C. D. Manning, ACL 2008 - Another improvement that is similar, but instead uses a CRF feature-based approach to improve over a fully lexicalized case.