Difference between revisions of "Accurate Unlexicalized Parsing"

From Cohen Courses
Jump to navigationJump to search
Line 27: Line 27:
  
 
In their experiments, they found that <math>v \leq 2, h \leq 2</math> was optimal, because, while it did not perform the best of the tests they did, it had significantly fewer symbols which would allow for them to do further rule splitting and still keep the grammar size managable. It is essentially the best compromise between improved performance from markovization, and grammar complexity.
 
In their experiments, they found that <math>v \leq 2, h \leq 2</math> was optimal, because, while it did not perform the best of the tests they did, it had significantly fewer symbols which would allow for them to do further rule splitting and still keep the grammar size managable. It is essentially the best compromise between improved performance from markovization, and grammar complexity.
 +
 +
=== Tag Splitting ===
 +
 +
The remainder of the methods involve splitting the grammar up and adding additional annotations.
 +
 +
==== Unary ====
 +
 +
They added annotations to:
 +
 +
* Unary-Internal: Any nonterminal with only one child
 +
* Unary-External: Any nonterminal that is an only child (discarded)
 +
* Unary-DT: Any determiner that is an only child (subset of Unary-External that provided a benefit)
 +
* Unary-RB: Any adverb that is an only child (subset of Unary-External that provided a benefit)
 +
 +
== Experiments ==
 +
 +
They do the parsing evaluation using the [[UsesDataset::Penn Treebank]]. They use sections 2-21 for training and 22 for development. Section 23 is evaluated on.

Revision as of 11:35, 1 November 2011

This paper is a work in progress by Francis Keith

Citation

"Accurate Unlexicalized Parsing", D. Klein and C. D. Manning, ACL 2003

Online Version

An online version of this paper is available here [1]

Summary

One of the concepts in PCFG Parsing that has become fairly standard is the concept of lexicalizing the grammar; that is, annotating each node with a head word. The authors of the paper focus on using unlexicalized grammars, and attempt to exploit structural context in an effort to produce improved results that can potentially be combined with lexicalization to improve the state-of-the-art.

Lexicalized vs Unlexicalized

Some of the methods they use seem very similar to what goes on in a lexicalized PCFG. An important differentiation between the two comes in about content words vs function words. The argument is made that linguists will often differentiate between annotating nodes that have a different functional head, as opposed to a different content head. That is, the authors are attempting to leverage the linguistic structure of a phrase, not leverage past information obtained from training data. So is structurally different from from a linguistic standpoint, but to annotate using content words, like vs is what the authors are seeking to avoid.

Method

They do an iterative set of non-lexicalization improvements on the grammars and report the improving F1 scores. They use the standard CKY parsing algorithm for running their experiments. The grammar probabilities are given as unsmoothed maximum likelihood probabilities.

Markovization

The first step that they take is to markovize the rules. In markovization, we examine the vertical and horizontal ancestors of the current node. It is logical to take vertical, and horizontal ancestors. In the original TreeBank, , because only the current node is considered vertically (i.e., no parent node history stored), and the node is based on all the nodes in the rewrite rule that came before it.


In their experiments, they found that was optimal, because, while it did not perform the best of the tests they did, it had significantly fewer symbols which would allow for them to do further rule splitting and still keep the grammar size managable. It is essentially the best compromise between improved performance from markovization, and grammar complexity.

Tag Splitting

The remainder of the methods involve splitting the grammar up and adding additional annotations.

Unary

They added annotations to:

  • Unary-Internal: Any nonterminal with only one child
  • Unary-External: Any nonterminal that is an only child (discarded)
  • Unary-DT: Any determiner that is an only child (subset of Unary-External that provided a benefit)
  • Unary-RB: Any adverb that is an only child (subset of Unary-External that provided a benefit)

Experiments

They do the parsing evaluation using the Penn Treebank. They use sections 2-21 for training and 22 for development. Section 23 is evaluated on.