Difference between revisions of "Chiang 2005"

Revision as of 01:07, 2 November 2011

Citation

Chiang, D. 2005. A Hierarchical Phrase-Based Model for Statistical Machine Translation. In Proceedings of the 43rd Annual Meeting of the ACL, pp. 263–270, Ann Arbor. Association for Computational Linguistics.

Online version

Information Sciences Institute, University of Southern California

Summary

This paper presents a statistical phrase-based machine translation model that uses hierarchical phrases (phrases that contain subphrases). The model is formally syntax-based because it uses Synchronous Context-Free Grammars (synchronous CFG) but not linguistically syntax-based because the grammar is learned from a parallel text without using any linguistic annotations or assumptions. Using BLEU as a metric, it is shown to outperform previous state-of-the-art phrase-based systems.

Motivation

The hierarchical model is motivated by the inability of conventional phrase-based models to learn reorderings of phrases (they only learn local reorderings of words). For example, considering the following Mandarin sentence:

Aozhou    shi yu   Bei   Han   you  bangjiao             de   shaoshu guojia    zhiyi
Australia is  with North Korea have diplomatic relations that few     countries one of

(Australia is one of the few countries that have diplomatic relations with North Korea)

the typical output of a conventional phrase-based system would be:

Australia is diplomatic relations with North Korea is one of the few countries

because it is able to do the local reorderings of "diplomatic ... Korea" and "one ... countries" but fails to perform the inversion of the two groups.

To solve this problem, the proposal is to have pairs of hierarchical phrases that consist of both words and subphrases. These pairs are formally defined as productions of a synchronous CFG. Then, in the previous example, the following productions are sufficient to translate the previous sentence correctly:

The synchronous CFG model

Based on the definition of synchronous CFGs, the basic elements of the model are weighted rewrite rules with aligned pairs of right-handed sides, of the form: $X\rightarrow \left\langle \gamma ,\alpha ,\sim \right\rangle$ , where $\,X\,$ is a non-terminal, $\,\gamma \,$ and $\,\alpha \,$ are strings of terminals and non-terminals (one in the source-language and the other in the target-language), and $\,\sim \,$ is a one-to-one correspondence between non-terminal occurences in $\,\gamma \,$ and $\,\alpha \,$ . The weight of each rule is determined by a log-linear model:

where $\,\phi _{i}\,$ are features defined on rules, including: noisy-channel model features, lexical weights which estimate how well the words in $\,\alpha \,$ translate to words in $\,\gamma \,$ and also a phrase penalty to allow the model assign preferences for longer or shorter derivations. Additionally, the model uses two special "glue" rules which enables the model to build only partial translations with hierarchical phrases and then serially combine them:

The weight of the first special rule is $exp(-\lambda _{g})$ which controls the preference for hierarchical phrases over serial combination of phrases, and the weight of the second one is always one. The following partial derivation of a synchronous CFG shows how the "glue" rules and the standard ones are combined together:

Training

The training process starts with a word-aligned corpus and produces "initial phrase pairs" using conventional phrase-based methods (from Koehn et. al. 2003 and Och and Ney 2004). Then, it forms all possible differences of phrase pairs, defining the set of rules to be extracted as the smallest set satisfying the following:

1. If $\left\langle f,e\right\rangle$ is an initial phrase pair, then $X\rightarrow \left\langle f,e\right\rangle$ is a rule.

2. If $r=X\rightarrow \left\langle \gamma ,\alpha \right\rangle$ is a rule and $\left\langle f,e\right\rangle$ is an initial phrase pair such that $\,\gamma =\gamma _{1}\,f\,\gamma _{2}\,$ and $\,\alpha =\alpha _{1}\,e\,\alpha _{2}\,$ , then $X\rightarrow \left\langle \,\gamma _{1}\,X_{k}\,\gamma _{2}\,,\,\alpha _{1}\,X_{k}\,\alpha _{2}\,\right\rangle$ is a rule.

This procedure generates too many rules, making training and decoding very slow and creating spurious ambiguity. Then, the grammar is filtered according to some principles designed to balance grammar size and performance on a development set, including: keep only the smallest initial phrase pairs containing the same set of alignment points, limit initial phrases to a length of 10 and rules to 5 (terminals plus non-terminals) on the source-language right-hand side, discard rules with more than 2 non-terminals and rules with adjacent non-terminals in the right-hand side, and keep only rules with at least one pair of aligned words.

Regarding the rule weights, since the training process extracts many rules from a single initial phrase pair, it distributes weight equally among intial phrase pairs but distribute that weight equally among the related rules.

Decoding

The decoder process uses a CKY parser with beam search together with a postprocessor for mapping source-language derivations to target-language derivations. Given a sentence in the source language $\,f\,$ , it finds the best derivation (or N best derivations) that generates $\left\langle f,e\right\rangle$ for some $\,e\,$ .

The search space is pruned in several ways: an item that has a score worse than $\beta$ times the best score in the same cell is discarded; and an item that is worse than the $b$ -th best item in the same cell is also discarded. The values of $\beta$ and $b$ are chosen to balance speed and performance in a development set.

Experimental results

The model was tested on Mandarin-to-English translation, using the FBIS corpus for the translation model, the 2002 NIST MT evaluation dataset as the development set and the 2003 test set as the test set. Three different systems were compared: the baseline system (Pharaoh, the current state-of-the-art phrase-based system), the hierarchical model and an "enhanced" hierarchical model using a constituent feature; the following table shows the results:

The following figure shows a selection of extracted rules, with ranks after filtering for the development set:

Related papers

Enhanced versions of this model have been described in several papers, such as Watanabe et al., EMNLP 2007. Online Large-Margin Training for Statistical Machine Translation and A Discriminative Latent Variable Model for SMT.

@@ Line 60: / Line 60: @@
 == Experimental results ==
-The model was tested on Mandarin-to-English translation, using the [[UsesDataset::FBIS corpus]] for the translation model, the Xinhua portion of the [[UsesDataset::Gigaword corpus]] to build the English language model, the 2002 [[UsesDataset::NIST MT]] evaluation dataset as the development set and the dataset from 2003 as the test set. Three different systems were compared: the baseline system (Pharaoh, the current state-of-the-art phrase-based system), the hierarchical model and an "enhanced" hierarchical model using a constituent feature; the following table shows the results:
+The model was tested on Mandarin-to-English translation, using the [[UsesDataset::FBIS corpus]] for the translation model, the 2002 [[UsesDataset::NIST MT]] evaluation dataset as the development set and the 2003 test set as the test set. Three different systems were compared: the baseline system (Pharaoh, the current state-of-the-art phrase-based system), the hierarchical model and an "enhanced" hierarchical model using a constituent feature; the following table shows the results:
 [[File:f6.png]]

Difference between revisions of "Chiang 2005"

Revision as of 01:07, 2 November 2011

Contents

Citation

Online version

Summary

Motivation

The synchronous CFG model

Training

Decoding

Experimental results

Related papers

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools