Machine Transliteration

From Cohen Courses
Revision as of 23:02, 31 October 2011 by Fkeith (talk | contribs)
Jump to navigationJump to search

This paper is a work in progress by Francis Keith

Citation

"Machine Transliteration", K. Knight and J. Graehl, CL 1998

Online Version

An online version of the paper is available here [1]

Summary

This paper examines using FSTs to solve the problem of transliteration in machine translation. Transliteration is the process of translating proper names and technical terms. In some cases, this is easier than others. The paper specifically examines Japanese-English transliteration.

The Problem

Japanese employs a very different phonetic alphabet from English. However, in the case of proper names, this often means doing a conversion from the English name into a more Japanese pronunciation. One example of this is that Japanese has no differentiation between 'L' and 'R', or 'F' and 'H'. While this may be easy in English-to-Japanese transliteration, it is significantly more difficult and less forgiving to do Japanese-to-English transliterations. They also deal with hand-written text, so OCR errors are introduced.

The Method

They divide the problem up into 5 steps, each of which can be defined as a probabilistic model:

  • The English text is written. This is
  • The English text is pronounced (as it is a pronunciation mapping in transliteration). This is the probability of a pronunciation given the word, or
  • The pronunciation is changed to use Japanese pronunciation sounds. This is the probability of a set of sounds from the Japanese language, given the english, or
  • The sounds are converted to katakana (the alphabet used for foreign or technical words). This is the probability of the Japanese text given the sounds, or
  • The katakana is written. This is the probability of a given handwritten phrase in katakana given the proper phrase, or

is given as a standard WFSA, while the other probabilities are given by WFSTs. These models are then composed together to produce a large WFST for doing transliteration.

- English Word Sequences

They built this model off a simple unigram probability. They also had a separate model for personal names alone.

- English Words to English Sounds

They built this WFST from the CMU pronunciation dictionary.

- English Sounds to Japanese Sounds

This WFST was built using a dictionary of 8,000 Japanese to English pronunciation pairs. They used EM to train the alignments from each English pronunciation symbol to one or more Japanese pronunciation symbols. They specifically avoided computing cases where the English symbol aligned with no Japanese pronunciation symbols, because it caused the computation time to increase significantly, and introduces more potentially harmful alignments.

- Japanese Sounds to Katakana Words

This was a manually produced WFST, based on both corpus knowledge and knowledge from a Japanese textbook.

- Katakana Text to Handwritten Katakana

This was trained, again using EM, on 19,500 instances of katakana handwritten characters mapped with their output from an OCR system.

Experiments

They run the composed model on 2 experiments. The first (which they don't give results for) is on a set of 222 phrases that were missing from a bilingual dictionary. The second task is run ignoring the OCR aspect, and using only the personal name WFSA. They produced the transliteration for 100 U.S. politicians (i.e. the English-to-Japanese transliteration), and then tested their system on the Japanese-to-English transliteration. They compared it to 4 native English speakers (who were politically aware).

Human System
Correct 27% 64%
Phonetically correct (misspelled) 7% 12%
Incorrect 66% 24%

The system vastly outperformed the humans. In addition, the authors surmised that improving the language model () would be able to fix many of the errors they were seeing.

Related Work

This was one of the earlier uses of applying WFSTs to machine translation.

Some other work: