Difference between revisions of "Machine Transliteration"

From Cohen Courses
Jump to navigationJump to search
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
This [[Category::Paper|paper]] is a work in progress by [[User:Fkeith|Francis Keith]]
 
 
 
== Citation ==
 
== Citation ==
  
Line 7: Line 5:
 
== Online Version ==
 
== Online Version ==
  
An online version of the paper is available here [http://aclweb.org/anthology-new/J/J98/J98-4003.pdf]
+
An online version of the [[Category::Paper|paper]] is available here [http://aclweb.org/anthology-new/J/J98/J98-4003.pdf]
  
 
== Summary ==  
 
== Summary ==  
Line 15: Line 13:
 
== The Problem ==
 
== The Problem ==
  
Japanese employs a very different phonetic alphabet from English. However, in the case of proper names, this often means doing a conversion from the English name into a more Japanese pronunciation. One example of this is that Japanese has no differentiation between 'L' and 'R', or 'F' and 'H'. While this may be easy in English-to-Japanese transliteration, it is significantly more difficult and less forgiving to do Japanese-to-English transliterations.
+
Japanese employs a very different phonetic alphabet from English. However, in the case of proper names, this often means doing a conversion from the English name into a more Japanese pronunciation. One example of this is that Japanese has no differentiation between 'L' and 'R', or 'F' and 'H'. While this may be easy in English-to-Japanese transliteration, it is significantly more difficult and less forgiving to do Japanese-to-English transliterations. They also deal with hand-written text, so [[AddressesProblem::OCR|OCR]] errors are introduced.
  
 
== The Method ==
 
== The Method ==
 +
 +
They divide the problem up into 5 steps, each of which can be defined as a probabilistic model:
 +
 +
* The English text is written. This is <math>P(w)</math>
 +
* The English text is pronounced (as it is a pronunciation mapping in transliteration). This is the probability of a pronunciation given the word, or <math>P(e|w)</math>
 +
* The pronunciation is changed to use Japanese pronunciation sounds. This is the probability of a set of sounds from the Japanese language, given the english, or <math>P(j|e)</math>
 +
* The sounds are converted to katakana (the alphabet used for foreign or technical words). This is the probability of the Japanese text given the sounds, or <math>P(k|j)</math>
 +
* The katakana is written. This is the probability of a given handwritten phrase in katakana given the proper phrase, or <math>P(o|k)</math>
 +
 +
<math>P(w)</math> is given as a standard WFSA, while the other probabilities are given by WFSTs. These models are then composed together to produce a large WFST for doing transliteration.
 +
 +
=== <math>P(w)</math> - English Word Sequences ===
 +
 +
They built this model off a simple unigram probability. They also had a separate model for personal names alone.
 +
 +
=== <math>P(e|w)</math> - English Words to English Sounds ===
 +
 +
They built this WFST from the CMU pronunciation dictionary.
 +
 +
=== <math>P(j|e)</math> - English Sounds to Japanese Sounds ===
 +
 +
This WFST was built using a dictionary of 8,000 Japanese to English pronunciation pairs. They used [[UsesMethod::Expectation Maximization|EM]] to train the alignments from each English pronunciation symbol to one or more Japanese pronunciation symbols. They specifically avoided computing cases where the English symbol aligned with no Japanese pronunciation symbols, because it caused the computation time to increase significantly, and introduces more potentially harmful alignments.
 +
 +
=== <math>P(k|j)</math> - Japanese Sounds to Katakana Words ===
 +
 +
This was a manually produced WFST, based on both corpus knowledge and knowledge from a Japanese textbook.
 +
 +
=== <math>P(o|k)</math> - Katakana Text to Handwritten Katakana ===
 +
 +
This was trained, again using [[UsesMethod::Expectation Maximization|EM]], on 19,500 instances of katakana handwritten characters mapped with their output from an OCR system.
 +
 +
== Experiments ==
 +
 +
They run the composed model on 2 experiments. The first (which they don't give results for) is on a set of 222 phrases that were missing from a bilingual dictionary. The second task is run ignoring the OCR aspect, and using only the personal name <math>P(w)</math> WFSA. They produced the transliteration for 100 U.S. politicians (i.e. the English-to-Japanese transliteration), and then tested their system on the Japanese-to-English transliteration. They compared it to 4 native English speakers (who were politically aware).
 +
 +
{| class="wikitable" border="1"
 +
|-
 +
!
 +
! Human
 +
! System
 +
|-
 +
! Correct
 +
| 27%
 +
| 64%
 +
|-
 +
! Phonetically correct (misspelled)
 +
| 7%
 +
| 12%
 +
|-
 +
! Incorrect
 +
| 66%
 +
| 24%
 +
|}
 +
 +
The system vastly outperformed the humans. In addition, the authors surmised that improving the language model (<math>P(w)</math>) would be able to fix many of the errors they were seeing.
 +
 +
== Related Work ==
 +
 +
This was one of the earlier uses of applying WFSTs to machine translation.
 +
 +
Some other work:
 +
 +
* [[RelatedPaper::Training Tree Transducers, J. Graehl, K. Knight, NAACL-HLT, 2004]] - This paper expands upon using a finite-state transducer, and instead exploits a tree structure to allow for reordering
 +
* [[RelatedPaper::Graphical Models over Multiple Strings, M. Dreyer and J. Eisner, EMNLP 2009]] - This paper is not a MT paper, but it does involve using WFSTs as factors within a Markov Random Field. This could be useful as a model for translation.
 +
* [[RelatedPaper::Parameter Estimation for Probabilistic Finite-State Transducers, J. Eisner, ACL 2002]] - A more general WFST description

Latest revision as of 19:58, 1 November 2011

Citation

"Machine Transliteration", K. Knight and J. Graehl, CL 1998

Online Version

An online version of the paper is available here [1]

Summary

This paper examines using FSTs to solve the problem of transliteration in machine translation. Transliteration is the process of translating proper names and technical terms. In some cases, this is easier than others. The paper specifically examines Japanese-English transliteration.

The Problem

Japanese employs a very different phonetic alphabet from English. However, in the case of proper names, this often means doing a conversion from the English name into a more Japanese pronunciation. One example of this is that Japanese has no differentiation between 'L' and 'R', or 'F' and 'H'. While this may be easy in English-to-Japanese transliteration, it is significantly more difficult and less forgiving to do Japanese-to-English transliterations. They also deal with hand-written text, so OCR errors are introduced.

The Method

They divide the problem up into 5 steps, each of which can be defined as a probabilistic model:

  • The English text is written. This is
  • The English text is pronounced (as it is a pronunciation mapping in transliteration). This is the probability of a pronunciation given the word, or
  • The pronunciation is changed to use Japanese pronunciation sounds. This is the probability of a set of sounds from the Japanese language, given the english, or
  • The sounds are converted to katakana (the alphabet used for foreign or technical words). This is the probability of the Japanese text given the sounds, or
  • The katakana is written. This is the probability of a given handwritten phrase in katakana given the proper phrase, or

is given as a standard WFSA, while the other probabilities are given by WFSTs. These models are then composed together to produce a large WFST for doing transliteration.

- English Word Sequences

They built this model off a simple unigram probability. They also had a separate model for personal names alone.

- English Words to English Sounds

They built this WFST from the CMU pronunciation dictionary.

- English Sounds to Japanese Sounds

This WFST was built using a dictionary of 8,000 Japanese to English pronunciation pairs. They used EM to train the alignments from each English pronunciation symbol to one or more Japanese pronunciation symbols. They specifically avoided computing cases where the English symbol aligned with no Japanese pronunciation symbols, because it caused the computation time to increase significantly, and introduces more potentially harmful alignments.

- Japanese Sounds to Katakana Words

This was a manually produced WFST, based on both corpus knowledge and knowledge from a Japanese textbook.

- Katakana Text to Handwritten Katakana

This was trained, again using EM, on 19,500 instances of katakana handwritten characters mapped with their output from an OCR system.

Experiments

They run the composed model on 2 experiments. The first (which they don't give results for) is on a set of 222 phrases that were missing from a bilingual dictionary. The second task is run ignoring the OCR aspect, and using only the personal name WFSA. They produced the transliteration for 100 U.S. politicians (i.e. the English-to-Japanese transliteration), and then tested their system on the Japanese-to-English transliteration. They compared it to 4 native English speakers (who were politically aware).

Human System
Correct 27% 64%
Phonetically correct (misspelled) 7% 12%
Incorrect 66% 24%

The system vastly outperformed the humans. In addition, the authors surmised that improving the language model () would be able to fix many of the errors they were seeing.

Related Work

This was one of the earlier uses of applying WFSTs to machine translation.

Some other work: