Difference between revisions of "Chiang 2005"
(9 intermediate revisions by the same user not shown) | |||
Line 9: | Line 9: | ||
== Summary == | == Summary == | ||
− | This [[Category::paper]] introduces a [[UsesMethod::Hierarchical phrase-based translation|hierarchical phrase-based]] model for statistical [[AddressesProblem::Machine Translation|machine translation]]. It describes a training method to extract the synchronous context-free grammar rules from parallel text and also an efficient decoding algorithm to apply | + | This [[Category::paper]] introduces a [[UsesMethod::Hierarchical phrase-based translation|hierarchical phrase-based]] model for statistical [[AddressesProblem::Machine Translation|machine translation]]. It describes a training method to extract the synchronous context-free grammar (synchronous CFG) rules from parallel text and also an efficient decoding algorithm to apply them. Using BLEU as a metric, the model is shown to outperform previous state-of-the-art phrase-based systems. |
=== Rule features === | === Rule features === | ||
− | According to the proposed [[UsesMethod::Hierarchical phrase-based translation#Definition|hierarchical model]], the weights of the rules: <math> X \rightarrow \left \langle \gamma , \alpha \right \rangle </math> are determined by a log-linear model: | + | According to the proposed [[UsesMethod::Hierarchical phrase-based translation#Definition|hierarchical model]], the weights of the synchronous CFG rules: <math> X \rightarrow \left \langle \gamma , \alpha \right \rangle </math> are determined by a log-linear model: |
[[File:f2.png]] | [[File:f2.png]] | ||
− | The <math>\, \phi_{i} \,</math> features used | + | The <math>\, \phi_{i} \,</math> features used in this paper were taken from analogous features of conventional phrase-based systems: |
* <math>\, P(\gamma | \alpha) \,</math>, from the noisy-channel model | * <math>\, P(\gamma | \alpha) \,</math>, from the noisy-channel model | ||
* <math>\, P(\alpha | \gamma) \,</math>, proven to be a helpful feature | * <math>\, P(\alpha | \gamma) \,</math>, proven to be a helpful feature | ||
Line 25: | Line 25: | ||
=== Training === | === Training === | ||
− | The training process starts with a word-aligned corpus and produces "initial phrase pairs" using conventional phrase-based methods (from [[RelatedPaper::Koehn et. al. 2003]] and [[RelatedPaper::Och and Ney 2004]]). Then, it | + | The training process starts with a word-aligned corpus and produces "initial phrase pairs" using conventional phrase-based methods (from [[RelatedPaper::Koehn et. al. 2003]] and [[RelatedPaper::Och and Ney 2004]]). Then, it extracts a set of rules from these phrase pairs, formally defined as the smallest set satisfying the following: |
1. If <math> \left \langle f , e \right \rangle </math> is an initial phrase pair, then <math> X \rightarrow \left \langle f , e \right \rangle </math> is a rule. | 1. If <math> \left \langle f , e \right \rangle </math> is an initial phrase pair, then <math> X \rightarrow \left \langle f , e \right \rangle </math> is a rule. | ||
Line 31: | Line 31: | ||
2. If <math>r = X \rightarrow \left \langle \gamma , \alpha \right \rangle </math> is a rule and <math> \left \langle f , e \right \rangle </math> is an initial phrase pair such that <math>\, \gamma = \gamma_{1} \, f \, \gamma_{2} \,</math> and <math>\, \alpha = \alpha_{1} \, e \, \alpha_{2} \,</math>, then <math> X \rightarrow \left \langle \, \gamma_{1} \, X_{k} \, \gamma_{2} \, , \, \alpha_{1} \, X_{k} \, \alpha_{2} \, \right \rangle </math> is a rule. | 2. If <math>r = X \rightarrow \left \langle \gamma , \alpha \right \rangle </math> is a rule and <math> \left \langle f , e \right \rangle </math> is an initial phrase pair such that <math>\, \gamma = \gamma_{1} \, f \, \gamma_{2} \,</math> and <math>\, \alpha = \alpha_{1} \, e \, \alpha_{2} \,</math>, then <math> X \rightarrow \left \langle \, \gamma_{1} \, X_{k} \, \gamma_{2} \, , \, \alpha_{1} \, X_{k} \, \alpha_{2} \, \right \rangle </math> is a rule. | ||
− | This procedure generates too many rules, making training and decoding very slow and creating spurious ambiguity. | + | This procedure generates too many rules, making training and decoding very slow and creating spurious ambiguity. Therefore, the grammar is filtered according to a set of principles designed to balance grammar size and performance on a development set: |
+ | |||
+ | * keep only the smallest initial phrase pairs containing the same set of alignment points | ||
+ | * limit initial phrases to a length of 10 | ||
+ | * limit rules to a length of 5 (terminals plus non-terminals) on the source-language right-hand side | ||
+ | * discard rules with more than 2 non-terminals | ||
+ | * discard rules with adjacent non-terminals in the right-hand side | ||
+ | * keep only rules with at least one pair of aligned words | ||
Regarding the rule weights, since the training process extracts many rules from a single initial phrase pair, it distributes weight equally among intial phrase pairs but distribute that weight equally among the related rules. | Regarding the rule weights, since the training process extracts many rules from a single initial phrase pair, it distributes weight equally among intial phrase pairs but distribute that weight equally among the related rules. | ||
Line 43: | Line 50: | ||
== Experimental results == | == Experimental results == | ||
− | The model was tested on Mandarin-to-English translation, using the [[UsesDataset::FBIS corpus]] for the translation model, the 2002 [[UsesDataset::NIST MT]] evaluation dataset as the development set and the 2003 test set as the test set. Three different systems were compared: the baseline system (Pharaoh, the current state-of-the-art phrase-based system), the hierarchical model and an "enhanced" hierarchical model using a constituent feature | + | The model was tested on Mandarin-to-English translation, using the [[UsesDataset::FBIS corpus]] for the translation model, the 2002 [[UsesDataset::NIST MT]] evaluation dataset as the development set and the 2003 test set as the test set. The training process obtained a grammar of 24M rules in the first step, which were filtered and reduced to 2.2M rules after using the development set. The following figure shows a selection of extracted rules and their ranks: |
+ | |||
+ | [[File:f7.png]] | ||
+ | |||
+ | Three different systems were compared: the baseline system (Pharaoh, the current state-of-the-art phrase-based system), the hierarchical model and an "enhanced" hierarchical model using a constituent feature. The following table shows the results: | ||
[[File:f6.png]] | [[File:f6.png]] | ||
− | The | + | The paper concludes that hierarchical phrase pairs significantly improve translation accuracy compared to conventional phrase-based systems. They also make easier the incorporation of syntactic information, however, the experiment of adding a constituent feature did not provide a statistically significant gain. |
− | |||
− | |||
== Related papers == | == Related papers == | ||
Line 55: | Line 64: | ||
A follow-up paper, [[RelatedPaper::Chiang 2007]], expands this one with a detailed description and experimental results of the decoding algorithm ("cube pruning"). | A follow-up paper, [[RelatedPaper::Chiang 2007]], expands this one with a detailed description and experimental results of the decoding algorithm ("cube pruning"). | ||
− | Enhanced versions of this model have been described in several papers, such as [[RelatedPaper::Watanabe et al., EMNLP 2007. Online Large-Margin Training for Statistical Machine Translation]] and [[RelatedPaper::A Discriminative Latent Variable Model for SMT]]. | + | Enhanced versions of this model have been described in several papers, such as [[RelatedPaper::Watanabe et al., EMNLP 2007. Online Large-Margin Training for Statistical Machine Translation]] and [[RelatedPaper::A Discriminative Latent Variable Model for SMT]], where they incorporate new sets of features for the hierarchical model. |
Latest revision as of 02:29, 2 November 2011
Contents
Citation
Chiang, D. 2005. A Hierarchical Phrase-Based Model for Statistical Machine Translation. In Proceedings of the 43rd Annual Meeting of the ACL, pp. 263–270, Ann Arbor. Association for Computational Linguistics.
Online version
Information Sciences Institute, University of Southern California
Summary
This paper introduces a hierarchical phrase-based model for statistical machine translation. It describes a training method to extract the synchronous context-free grammar (synchronous CFG) rules from parallel text and also an efficient decoding algorithm to apply them. Using BLEU as a metric, the model is shown to outperform previous state-of-the-art phrase-based systems.
Rule features
According to the proposed hierarchical model, the weights of the synchronous CFG rules: are determined by a log-linear model:
The features used in this paper were taken from analogous features of conventional phrase-based systems:
- , from the noisy-channel model
- , proven to be a helpful feature
- lexical weights which estimate how well the words in translate the words in
- a phrase penalty to allow the model assign preferences for longer or shorter derivations.
Training
The training process starts with a word-aligned corpus and produces "initial phrase pairs" using conventional phrase-based methods (from Koehn et. al. 2003 and Och and Ney 2004). Then, it extracts a set of rules from these phrase pairs, formally defined as the smallest set satisfying the following:
1. If is an initial phrase pair, then is a rule.
2. If is a rule and is an initial phrase pair such that and , then is a rule.
This procedure generates too many rules, making training and decoding very slow and creating spurious ambiguity. Therefore, the grammar is filtered according to a set of principles designed to balance grammar size and performance on a development set:
- keep only the smallest initial phrase pairs containing the same set of alignment points
- limit initial phrases to a length of 10
- limit rules to a length of 5 (terminals plus non-terminals) on the source-language right-hand side
- discard rules with more than 2 non-terminals
- discard rules with adjacent non-terminals in the right-hand side
- keep only rules with at least one pair of aligned words
Regarding the rule weights, since the training process extracts many rules from a single initial phrase pair, it distributes weight equally among intial phrase pairs but distribute that weight equally among the related rules.
Decoding
The decoder process uses a CKY parser with beam search together with a postprocessor for mapping source-language derivations to target-language derivations. Given a sentence in the source language , it finds the best derivation (or N best derivations) that generates for some .
The search space is pruned in several ways: an item that has a score worse than times the best score in the same cell is discarded; and an item that is worse than the -th best item in the same cell is also discarded. The values of and are chosen to balance speed and performance in a development set.
Experimental results
The model was tested on Mandarin-to-English translation, using the FBIS corpus for the translation model, the 2002 NIST MT evaluation dataset as the development set and the 2003 test set as the test set. The training process obtained a grammar of 24M rules in the first step, which were filtered and reduced to 2.2M rules after using the development set. The following figure shows a selection of extracted rules and their ranks:
Three different systems were compared: the baseline system (Pharaoh, the current state-of-the-art phrase-based system), the hierarchical model and an "enhanced" hierarchical model using a constituent feature. The following table shows the results:
The paper concludes that hierarchical phrase pairs significantly improve translation accuracy compared to conventional phrase-based systems. They also make easier the incorporation of syntactic information, however, the experiment of adding a constituent feature did not provide a statistically significant gain.
Related papers
A follow-up paper, Chiang 2007, expands this one with a detailed description and experimental results of the decoding algorithm ("cube pruning").
Enhanced versions of this model have been described in several papers, such as Watanabe et al., EMNLP 2007. Online Large-Margin Training for Statistical Machine Translation and A Discriminative Latent Variable Model for SMT, where they incorporate new sets of features for the hierarchical model.