Difference between revisions of "Chiang 2005"

From Cohen Courses
Jump to navigationJump to search
Line 43: Line 43:
  
 
The training process starts with a word-aligned corpus using standard methods and then uses heuristics to hypothesize a distribution of possible derivations of each training example. Then, it estimates the phrase translation parameters from this distribution. The set of extracted rules is the smallest set satisfying the following:
 
The training process starts with a word-aligned corpus using standard methods and then uses heuristics to hypothesize a distribution of possible derivations of each training example. Then, it estimates the phrase translation parameters from this distribution. The set of extracted rules is the smallest set satisfying the following:
1. Each aligned phrase pair <math> \left \langle f , e \right \rangle </math> learned from conventional methods is taken as an "initial phrase pair" and <math> X \rightarrow \left \langle f , e \rangle </math> is a rule.
+
1. Each aligned phrase pair <math> \left \langle f , e \right \rangle </math> learned from conventional methods is taken as an "initial phrase pair" and <math> X \rightarrow \left \langle f , e \right \rangle </math> is a rule.
 
2. xx.
 
2. xx.
  

Revision as of 21:47, 1 November 2011

Citation

Chiang, D. 2005. A Hierarchical Phrase-Based Model for Statistical Machine Translation. In Proceedings of the 43rd Annual Meeting of the ACL, pp. 263–270, Ann Arbor. Association for Computational Linguistics.

Online version

Information Sciences Institute, University of Southern California

Summary

This paper presents a statistical phrase-based machine translation model that uses hierarchical phrases (phrases that contain subphrases). The model is formally syntax-based because it uses Synchronous Context-Free Grammars (synchronous CFG) but not linguistically syntax-based because the grammar is learned from a parallel text without using any linguistic annotations or assumptions. Using BLEU as a metric, it is shown to outperform previous state-of-the-art phrase-based systems.

Motivation

The hierarchical model is motivated by the inability of conventional phrase-based models to learn reorderings of phrases (they only learn local reorderings of words). For example, considering the following Mandarin sentence:

Aozhou    shi yu   Bei   Han   you  bangjiao             de   shaoshu guojia    zhiyi
Australia is  with North Korea have diplomatic relations that few     countries one of

(Australia is one of the few countries that have diplomatic relations with North Korea)

the typical output of a conventional phrase-based system would be:

Australia is diplomatic relations with North Korea is one of the few countries

because it is able to do the local reorderings of "diplomatic ... Korea" and "one ... countries" but fails to perform the inversion of the two groups.

To solve this problem, the proposal is to have pairs of hierarchical phrases that consist of both words and subphrases. These pairs are formally defined as productions of a synchronous CFG. Then, in the previous example, the following productions are sufficient to translate the previous sentence correctly:

F1.png

The synchronous CFG model

Based on the definition of synchronous CFGs, the basic elements of the model are weighted rewrite rules with aligned pairs of right-handed sides, of the form: , where X is a non-terminal, and and are strings of terminals and non-terminals (one in the source language and the other in the target language). The weight of each rule is determined by a log-linear model:

F2.png

where are features defined on rules, including: noisy-channel model features, lexical weights which estimate how well the words in translate to words in and also a phrase penalty to allow the model assign preferences for longer or shorter derivations. Additionally, the model uses two special "glue" rules which enables the model to build only partial translations with hierarchical phrases and then serially combine them:

F3.png

The weight of the first special rule is which controls the preference for hierarchical phrases over serial combination of phrases, and the weight of the second one is always one. The following partial derivation of a synchronous CFG shows how the "glue" rules and the standard ones are combined together:

F4.png

Training

The training process starts with a word-aligned corpus using standard methods and then uses heuristics to hypothesize a distribution of possible derivations of each training example. Then, it estimates the phrase translation parameters from this distribution. The set of extracted rules is the smallest set satisfying the following: 1. Each aligned phrase pair learned from conventional methods is taken as an "initial phrase pair" and is a rule. 2. xx.

Experimental results

Related papers