Difference between revisions of "Machine Transliteration"

From Cohen Courses
Jump to navigationJump to search
Line 82: Line 82:
 
* [[RelatedWork::Training Tree Transducers, J. Graehl, K. Knight, NAACL-HLT, 2004]] - This paper expands upon using a finite-state transducer, and instead exploits a tree structure to allow for reordering
 
* [[RelatedWork::Training Tree Transducers, J. Graehl, K. Knight, NAACL-HLT, 2004]] - This paper expands upon using a finite-state transducer, and instead exploits a tree structure to allow for reordering
 
* [[RelatedWork::Graphical Models over Multiple Strings, M. Dreyer and J. Eisner, EMNLP 2009]] - This paper is not a MT paper, but it does involve using WFSTs as factors within a Markov Random Field. This could be useful as a model for translation.
 
* [[RelatedWork::Graphical Models over Multiple Strings, M. Dreyer and J. Eisner, EMNLP 2009]] - This paper is not a MT paper, but it does involve using WFSTs as factors within a Markov Random Field. This could be useful as a model for translation.
 +
* [[RelatedWork::Parameter Estimation for Probabilistic Finite-State Transducers, J. Eisner, ACL 2002]] - A more general WFST description

Revision as of 23:14, 31 October 2011

This paper is a work in progress by Francis Keith

Citation

"Machine Transliteration", K. Knight and J. Graehl, CL 1998

Online Version

An online version of the paper is available here [1]

Summary

This paper examines using FSTs to solve the problem of transliteration in machine translation. Transliteration is the process of translating proper names and technical terms. In some cases, this is easier than others. The paper specifically examines Japanese-English transliteration.

The Problem

Japanese employs a very different phonetic alphabet from English. However, in the case of proper names, this often means doing a conversion from the English name into a more Japanese pronunciation. One example of this is that Japanese has no differentiation between 'L' and 'R', or 'F' and 'H'. While this may be easy in English-to-Japanese transliteration, it is significantly more difficult and less forgiving to do Japanese-to-English transliterations. They also deal with hand-written text, so OCR errors are introduced.

The Method

They divide the problem up into 5 steps, each of which can be defined as a probabilistic model:

  • The English text is written. This is
  • The English text is pronounced (as it is a pronunciation mapping in transliteration). This is the probability of a pronunciation given the word, or
  • The pronunciation is changed to use Japanese pronunciation sounds. This is the probability of a set of sounds from the Japanese language, given the english, or
  • The sounds are converted to katakana (the alphabet used for foreign or technical words). This is the probability of the Japanese text given the sounds, or
  • The katakana is written. This is the probability of a given handwritten phrase in katakana given the proper phrase, or

is given as a standard WFSA, while the other probabilities are given by WFSTs. These models are then composed together to produce a large WFST for doing transliteration.

- English Word Sequences

They built this model off a simple unigram probability. They also had a separate model for personal names alone.

- English Words to English Sounds

They built this WFST from the CMU pronunciation dictionary.

- English Sounds to Japanese Sounds

This WFST was built using a dictionary of 8,000 Japanese to English pronunciation pairs. They used EM to train the alignments from each English pronunciation symbol to one or more Japanese pronunciation symbols. They specifically avoided computing cases where the English symbol aligned with no Japanese pronunciation symbols, because it caused the computation time to increase significantly, and introduces more potentially harmful alignments.

- Japanese Sounds to Katakana Words

This was a manually produced WFST, based on both corpus knowledge and knowledge from a Japanese textbook.

- Katakana Text to Handwritten Katakana

This was trained, again using EM, on 19,500 instances of katakana handwritten characters mapped with their output from an OCR system.

Experiments

They run the composed model on 2 experiments. The first (which they don't give results for) is on a set of 222 phrases that were missing from a bilingual dictionary. The second task is run ignoring the OCR aspect, and using only the personal name WFSA. They produced the transliteration for 100 U.S. politicians (i.e. the English-to-Japanese transliteration), and then tested their system on the Japanese-to-English transliteration. They compared it to 4 native English speakers (who were politically aware).

Human System
Correct 27% 64%
Phonetically correct (misspelled) 7% 12%
Incorrect 66% 24%

The system vastly outperformed the humans. In addition, the authors surmised that improving the language model () would be able to fix many of the errors they were seeing.

Related Work

This was one of the earlier uses of applying WFSTs to machine translation.

Some other work: