Difference between revisions of "Pereira and Riley, 1997"
(3 intermediate revisions by the same user not shown) | |||
Line 5: | Line 5: | ||
==Summary== | ==Summary== | ||
− | They created a speech recognition system by using four finite state automata to model the entire pipeline of sound to phonemes to words. Since each step was fairly well understood, they wanted to use their knowledge in each step to improve the efficacy of the entire process. | + | They created a [[speech recognition]] system by using four [[finite state automata|FSA]] to model the entire pipeline of sound to phonemes to words. Since each step was fairly well understood, they wanted to use their knowledge in each step to improve the efficacy of the entire process. By using transducers, the entire system held nice composability properties which allowed them to run the system differently depending on how big the system was. |
==Work== | ==Work== | ||
They began by creating a speech recognition system using the composition of three finite state automata: a transducer A that converted acoustic signals to phone sequences, a transducer D that converts phone sequences to word sequences, and a finite state acceptor M that models a 5-gram model. | They began by creating a speech recognition system using the composition of three finite state automata: a transducer A that converted acoustic signals to phone sequences, a transducer D that converts phone sequences to word sequences, and a finite state acceptor M that models a 5-gram model. | ||
− | One of the immediate problems was that their context independent phone model A | + | The biggest advantage of organizing the system of transducers this way is that they could apply properties about FSAs to the resulting computation. The calculation they want to perform is (A o D o M). While most would compose these together and then evaluation the probabilities of a new model, the computation can be organized differently. If (A o D o M) is too big for the computer's memory, one could instead calculate (A) then (A o D) then (A o D o M), pruning along the way. |
+ | |||
+ | One of the immediate problems was that their model A was context independent. This means that it assigned an acoustic signal to a phone with the same probability regardless of context. This turns out to be a very important consideration. Their solution was to make A a triphone model: one that would take an acoustic signal and output three phones. A problem then arose with D: D needed as input real phones, but what if A's output disagreed? For this, they inserted a new transducer, C, which converts from the context dependent parsing of A to the context independent of D. Using a transducer is a nice solution that plays right into the existing system. It also avoids the pitfalls of other systems that introduced substitution FSAs in this gap, which are much less powerful. | ||
+ | |||
+ | ==Results== | ||
+ | The results weren't quantitatively compared to the state of the art at the time. Since they had bigger systems than the state of the art, it was impressive that they were able to run on the machines they had. This was the main advantage of using finite state automata-- that they could easily compose existing problems and that they could (depending on the size of the problem) break the system up or do it all in one go. | ||
+ | |||
+ | ==Related Work== | ||
+ | Most of the ideas in their paper had previously been done by [[RelatedPaper::Bahl, Jelinek, et al, 1983]]. The key insight this paper was that they used transducers to reduce the size of the resulting FSA (the so called "transduction cascade approach"). |
Latest revision as of 02:37, 3 November 2011
Speech Recognition by Composition of Weighted Finite Automata is a paper by Fernando Pereira and Michael Riley available online.
Contents
Citation
Pereira and Riley. Speech Recognition by Composition of Weighted Finite Automata. In Finite-State Language Processing, pages: 431-453, 1997.
Summary
They created a speech recognition system by using four FSA to model the entire pipeline of sound to phonemes to words. Since each step was fairly well understood, they wanted to use their knowledge in each step to improve the efficacy of the entire process. By using transducers, the entire system held nice composability properties which allowed them to run the system differently depending on how big the system was.
Work
They began by creating a speech recognition system using the composition of three finite state automata: a transducer A that converted acoustic signals to phone sequences, a transducer D that converts phone sequences to word sequences, and a finite state acceptor M that models a 5-gram model.
The biggest advantage of organizing the system of transducers this way is that they could apply properties about FSAs to the resulting computation. The calculation they want to perform is (A o D o M). While most would compose these together and then evaluation the probabilities of a new model, the computation can be organized differently. If (A o D o M) is too big for the computer's memory, one could instead calculate (A) then (A o D) then (A o D o M), pruning along the way.
One of the immediate problems was that their model A was context independent. This means that it assigned an acoustic signal to a phone with the same probability regardless of context. This turns out to be a very important consideration. Their solution was to make A a triphone model: one that would take an acoustic signal and output three phones. A problem then arose with D: D needed as input real phones, but what if A's output disagreed? For this, they inserted a new transducer, C, which converts from the context dependent parsing of A to the context independent of D. Using a transducer is a nice solution that plays right into the existing system. It also avoids the pitfalls of other systems that introduced substitution FSAs in this gap, which are much less powerful.
Results
The results weren't quantitatively compared to the state of the art at the time. Since they had bigger systems than the state of the art, it was impressive that they were able to run on the machines they had. This was the main advantage of using finite state automata-- that they could easily compose existing problems and that they could (depending on the size of the problem) break the system up or do it all in one go.
Related Work
Most of the ideas in their paper had previously been done by Bahl, Jelinek, et al, 1983. The key insight this paper was that they used transducers to reduce the size of the resulting FSA (the so called "transduction cascade approach").