Difference between revisions of "Dreyer and Eisner, EMNLP 2006"

From Cohen Courses
Jump to navigationJump to search
 
(2 intermediate revisions by the same user not shown)
Line 10: Line 10:
  
 
==Previous work==
 
==Previous work==
As they say in the paper, "treebanks never contain enough information". Lots of parsing work had been done on splitting the nonterminals to be able to train the important nonterminal better. But the splitting was mostly done in an ad-hoc fashion until Matsuzaki et al (2005) with PCFG-LA (Probabilistic context-free grammar with latent annotations). They wanted to incorporate features that only propagated in "linguistically-motived ways". By only propagating it in linguistically-motivated ways, the authors avoided having to learn way too many weights. The system works the same as a PCFG, but there are many more rules. If we have a regular rule in the PCFG <math>X \rightarrow Y Z</math>, in PCFG-LA, it would be represented as <math>X[\alpha] \rightarrow Y[\beta] Z[\gamma]</math> where <math>\alpha, \beta</math> and <math>\gamma</math> are the features values (an integer).
+
As they say in the paper, "treebanks never contain enough information". Lots of parsing work had been done on splitting the nonterminals to be able to train the important nonterminal better. But the splitting was mostly done in an ad-hoc fashion until Matsuzaki et al (2005) with [[UsesMethod::PCFGs|PCFG]]-LA (Probabilistic context-free grammar with [[latent]] annotations). They wanted to incorporate features that only propagated in "linguistically-motived ways". By only propagating it in linguistically-motivated ways, the authors avoided having to learn way too many weights. The system works the same as a PCFG, but there are many more rules. If we have a regular rule in the PCFG <math>X \rightarrow Y Z</math>, in PCFG-LA, it would be represented as <math>X[\alpha] \rightarrow Y[\beta] Z[\gamma]</math> where <math>\alpha, \beta</math> and <math>\gamma</math> are the features values (an integer). In general, they allowed any number up to <math>L</math> as a feature value. Changing this value changes the number of probabilities that must be trained.
  
 
==Improvements in Paper==
 
==Improvements in Paper==
 
The main idea in the paper was to reduce the number of weights that needed to be learned. This is unequivocally a good thing since it means that either the model is smaller or you can train more weights for the same price. The authors decided that there were too many transfer probabilities. Instead of calculating the weights for any value of <math>\alpha</math> to any values of <math>\beta</math> and <math>\gamma</math>, the probabilities they calculated were <math>P(pass to head | rule), P(pass to both | rule), P(P_{ann}(feature_value | nonterminal)</math>. <math>P_{ann}</math> is the probability of annotating the nonterminal with some feature value given that it did not inherit the feature value from its parent. This greatly reduces the number of probabilities to calculate since we're grouping a bunch of the <math>\alpha, \beta, \gamma</math> weights all together.
 
The main idea in the paper was to reduce the number of weights that needed to be learned. This is unequivocally a good thing since it means that either the model is smaller or you can train more weights for the same price. The authors decided that there were too many transfer probabilities. Instead of calculating the weights for any value of <math>\alpha</math> to any values of <math>\beta</math> and <math>\gamma</math>, the probabilities they calculated were <math>P(pass to head | rule), P(pass to both | rule), P(P_{ann}(feature_value | nonterminal)</math>. <math>P_{ann}</math> is the probability of annotating the nonterminal with some feature value given that it did not inherit the feature value from its parent. This greatly reduces the number of probabilities to calculate since we're grouping a bunch of the <math>\alpha, \beta, \gamma</math> weights all together.
 +
 +
One problem they had was that this latent feature approach was a non-convex optimization problem. Like other work in non-convex optimization, they decided to start off at some point derived by an easier-to-calculate model and improve it from there.
 +
 +
What about multiple features? Linguistically, it makes sense to include more than one feature that can take multiple values: for example, we may have a number and gender feature, which wouldn't be represented well using this scheme. They implemented multiple features by treating each feature independently. Not much was said about the details of this model, but it did not perform good enough.
  
 
==Experimental Results==
 
==Experimental Results==
 +
The authors implemented PCFG-LA and compared it to their model, INHERIT. INHERIT performed practically the same as PCFG-LA, although it did use fewer weights. Some successes were had: it was able to differentiate a bit between singular and plural nouns which is an indication that the latent features provided some differentiation. Future work would involve using a conditional model to learn the probabilities in order to train the weights better.
  
  
 
[[RelatedPaper::Matsuzaki et al, ACL 2005]]
 
[[RelatedPaper::Matsuzaki et al, ACL 2005]]
 +
Researchers that tried to get more information from treebank parses:
 +
* [[Charniak, NCAI 1996]]
 +
* [[Eisner, COLING 1996]]

Latest revision as of 17:53, 27 November 2011

Better Informed Training of Latent Syntactic Features

This paper can be found at: [1]

Citation

Markus Dreyer and Jason Eisner. Better informed training of latent syntactic features. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 317–326, Sydney, Australia, 2006. DOI: 10.3115/1610075.1610120

Summary

They wanted to improve parsing accuracy by introducing hidden features. It would seem that linguistic features (such as number, gender, etc) would help in constraining the parse trees. They wanted to include these features, but the treebank did not include the features. So, the authors used a modified EM algorithm with simulated annealing to find the features and then construct rules that took advantage of the features. The main contribution is an attempt to improve upon the work of Matsuzaki et al by reducing the number of degrees of freedom to learn so that the syntactic features can take on a greater range of values. The new method allows less freedom in learning transfer probabilities. The end result was no improvement over the previous work.

Previous work

As they say in the paper, "treebanks never contain enough information". Lots of parsing work had been done on splitting the nonterminals to be able to train the important nonterminal better. But the splitting was mostly done in an ad-hoc fashion until Matsuzaki et al (2005) with PCFG-LA (Probabilistic context-free grammar with latent annotations). They wanted to incorporate features that only propagated in "linguistically-motived ways". By only propagating it in linguistically-motivated ways, the authors avoided having to learn way too many weights. The system works the same as a PCFG, but there are many more rules. If we have a regular rule in the PCFG , in PCFG-LA, it would be represented as where and are the features values (an integer). In general, they allowed any number up to as a feature value. Changing this value changes the number of probabilities that must be trained.

Improvements in Paper

The main idea in the paper was to reduce the number of weights that needed to be learned. This is unequivocally a good thing since it means that either the model is smaller or you can train more weights for the same price. The authors decided that there were too many transfer probabilities. Instead of calculating the weights for any value of to any values of and , the probabilities they calculated were . is the probability of annotating the nonterminal with some feature value given that it did not inherit the feature value from its parent. This greatly reduces the number of probabilities to calculate since we're grouping a bunch of the weights all together.

One problem they had was that this latent feature approach was a non-convex optimization problem. Like other work in non-convex optimization, they decided to start off at some point derived by an easier-to-calculate model and improve it from there.

What about multiple features? Linguistically, it makes sense to include more than one feature that can take multiple values: for example, we may have a number and gender feature, which wouldn't be represented well using this scheme. They implemented multiple features by treating each feature independently. Not much was said about the details of this model, but it did not perform good enough.

Experimental Results

The authors implemented PCFG-LA and compared it to their model, INHERIT. INHERIT performed practically the same as PCFG-LA, although it did use fewer weights. Some successes were had: it was able to differentiate a bit between singular and plural nouns which is an indication that the latent features provided some differentiation. Future work would involve using a conditional model to learn the probabilities in order to train the weights better.


Matsuzaki et al, ACL 2005 Researchers that tried to get more information from treebank parses: