Difference between revisions of "Lehnen et al., ICASSP 2011. Incorporating Alignments into Conditional Random Fields for Grapheme to Phoneme Conversion"

Latest revision as of 22:39, 30 September 2011

Citation

Patrick Lehnen, Stefan Hahn, Andreas Guta and Hermann Ney. 2011. Incorporating Alignments into Conditional Random Fields for Grapheme to Phoneme Conversion. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-2011.

Online Version

Incorporating Alignments into Conditional Random Fields for Grapheme to Phoneme Conversion

Summary

The authors present a novel approach in this paper for better grapheme to phoneme (g2p) conversion using conditional random fields. They argue that alignments are crucial in g2p conversion and are usually added by external models. Thus, the authors introduce an approach by which the alignment generation step can be efficiently added into the CRF training process. This is achieved in two ways. One in which linear segmentation is considered and the other in which all possible alignments given some constraints are incorporated in the CRF model. Apart from the standard CRF training process, the authors also introduce alignment as a hidden variable in the model.

Method

A conditional random field is modeled as:

$p(t_{1}^{N}|s_{1}^{N})={\frac {\exp H(t_{1}^{N},s_{1}^{N})}{\sum _{{\tilde {t}}_{1}^{N}}\exp H({{\tilde {t}}_{1}^{N}},s_{1}^{N})}}$

${\text{where, }}H(t_{1}^{N},s_{1}^{N})=\left(\sum _{n=1}^{N}\sum _{l=1}^{L}\lambda _{l}h_{l}(t_{n-1},t_{n},s_{1}^{N})\right)$

Alignments

The authors add alignment by modeling it as a hidden variable, $a_{1}^{M}$ in CRFs as follows,

$p(t_{1}^{M}|s_{1}^{M})=\sum _{a_{1}^{M}}p(t_{1}^{M},a_{1}^{M}|s_{1}^{N})$

They model the tuple $(t_{1}^{M},a_{1}^{M})$ by a projection using the BIO labeling scheme, restricting it to a 1-to-1 or many-to-one monotonic alignment scheme.

Training

The CRF model incorporating alignment as a hidden variable can be trained in two ways,

Maximization approach
Summation approach

Maximization Approach

This approach assumes a linear segmentation at the beginning and trains the CRF using an Expectation-Maximization like algorithm. The maximization step of the training process is given by,

$p(t_{1}^{N}|s_{1}^{N})|_{t_{1}^{N}=t_{1}^{N}(T_{1}^{M},a_{1}^{M})}={\frac {\exp H(t_{1}^{N},s_{1}^{N})}{\sum _{{\tilde {t}}_{1}^{N}}\exp H({{\tilde {t}}_{1}^{N}},s_{1}^{N})}}$

The expectation step is given by,

${\hat {a}}_{1}^{M}={\underset {a_{1}^{M}}{\operatorname {argmax} }}\left\{p(t_{1}^{N}(T_{1}^{M},a_{1}^{M})|s_{1}^{N})\right\}$

This training continues in a CRF training/resegmentation loop until convergence.

Summation Approach

In this approach, alignments are summer over directly by modeling the CRF as,

$p(T_{1}^{M}|s_{1}^{M})={\frac {\sum _{a_{1}^{M}}\exp H(T_{1}^{M},a_{1}^{M},s_{1}^{N})}{\sum _{{\tilde {a}}_{1}^{M}}\sum _{{\tilde {T}}_{1}^{M}}\exp H({\tilde {T}}_{1}^{M},{\tilde {a}}_{1}^{M},s_{1}^{N})}}$

={\frac {\sum _{t_{1}^{M}:a_{1}^{M}}\exp H(t_{1}^{N},s_{1}^{N})}{\sum _{{\tilde {t}}_{1}^{N}}\exp H({\tilde {t}}_{1}^{N},s_{1}^{N})}}

The numerator term is similar to the denominator term and can be solved by the same posterior approach using the Forward-Backward algorithm.

Experiments and Results

Dataset

Experiments are reported on two publicly available English g2p corpora:

NETtalk corpus, consisting about 15k grapheme/phoneme word pairs. About 1000 g2p pairs are used as development set. Gold standard manual alignments are available in this corpus.
The Celex corpus, containing about 40k g2p word pairs. Test set is about 15k words in size.

Evaluation Metric

The authors report error rates in terms of phoneme error rate (PER) and word error rate (WER).

Results

Results of the paper are shown in Table 1. The authors compare their approach with other automated alignment generation approaches like joint n-gram modeling, GIZA++ etc.

Table 1: Effect of various alignments on two g2p tasks.

Maximization is empirically shown to perform better the summation approach. Their approach compares favorably against joint n-gram sequence modeling approach and word alignment approach using GIZA++. Summation approach performs better than linear segmentation.

[2] Sittichai Jiampojamarn and Grzegorz Kondrak. 2009. Online discriminative training for grapheme-to-phoneme conversion. In Proceedings of ISCA Interspeech, Brighton, U.K., Sept. 2009, pp. 1303–1306.

@@ Line 1: / Line 1: @@
 == Citation ==
-Patrick Lehnen, Stefan Hahn, Andreas Guta and Hermann Ney. 2011. Incorporating Alignments into Conditional Random Fields for Grapheme to Phoneme Conversion. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-2011.
+Patrick Lehnen, Stefan Hahn, Andreas Guta and Hermann Ney. 2011. Incorporating Alignments into Conditional Random Fields for Grapheme to Phoneme Conversion. In ''Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing'', ICASSP-2011.
 == Online Version ==
@@ Line 6: / Line 6: @@
 == Summary ==
+The authors present a novel approach in this [[Category::paper]] for better [[AddressesProblem::grapheme to phoneme]] (g2p) conversion using [[UsesMethod::Conditional_Random_Fields | conditional random fields]]. They argue that alignments are crucial in g2p conversion and are usually added by external models. Thus, the authors introduce an approach by which the alignment generation step can be efficiently added into the CRF training process. This is achieved in two ways. One in which linear segmentation is considered and the other in which all possible alignments given some constraints are incorporated in the CRF model. Apart from the standard CRF training process, the authors also introduce alignment as a hidden variable in the model.
 == Method ==
+A [[Conditional_Random_Fields|conditional random field]] is modeled as:
+<math>
+p(t_1^N|s_1^N) = \frac {\exp H(t_1^N, s_1^N)}{\sum_{\tilde{t}_1^N}\exp H({\tilde{t}_1^N}, s_1^N)}
+</math>
+<math>
+\text{where, } H(t_1^N, s_1^N) = \left(  \sum_{n=1}^N \sum_{l=1}^L \lambda_l h_l(t_{n-1}, t_n, s_1^N)  \right)
+</math>
+=== Alignments ===
+The authors add alignment by modeling it as a hidden variable, <math>a_1^M</math> in CRFs as follows,
+<math>
+p(t_1^M|s_1^M) = \sum_{a_1^M}p(t_1^M, a_1^M|s_1^N)
+</math>
+They model the tuple <math> (t_1^M, a_1^M) </math> by a projection using the BIO labeling scheme, restricting it to a 1-to-1 or many-to-one monotonic alignment scheme.
+=== Training ===
+The CRF model incorporating alignment as a hidden variable can be trained in two ways,
+* Maximization approach
+* Summation approach
+==== Maximization Approach ====
+This approach assumes a linear segmentation at the beginning and trains the CRF using an [[UsesMethod::Expectation_Maximization|Expectation-Maximization]] like algorithm. The maximization step of the training process is given by,
+<math>
+p(t_1^N|s_1^N)|_{t_1^N=t_1^N(T_1^M,a_1^M)} = \frac {\exp H(t_1^N, s_1^N)}{\sum_{\tilde{t}_1^N}\exp H({\tilde{t}_1^N}, s_1^N)}
+</math>
+The expectation step is given by,
+<math>
+\hat{a}_1^M = \underset{a_1^M}{\operatorname{argmax}} \left\{  p(t_1^N(T_1^M,a_1^M)|s_1^N) \right\}
+</math>
+This training continues in a CRF training/resegmentation loop until convergence.
+==== Summation Approach ====
+In this approach, alignments are summer over directly by modeling the CRF as,
+<math>
+p(T_1^M|s_1^M) = \frac {\sum_{a_1^M} \exp H(T_1^M, a_1^M, s_1^N)}{\sum_{\tilde{a}_1^M} \sum_{\tilde{T}_1^M} \exp H(\tilde{T}_1^M, \tilde{a}_1^M, s_1^N)}
+</math>
+::: <math> = \frac {\sum_{t_1^M:a_1^M} \exp H(t_1^N, s_1^N)}{\sum_{\tilde{t}_1^N} \exp H(\tilde{t}_1^N, s_1^N)}  </math>
+The numerator term is similar to the denominator term and can be solved by the same posterior approach using the [[UsesMethod::Forward-Backward| Forward-Backward algorithm]].
 == Experiments and Results ==
+=== Dataset ===
+Experiments are reported on two publicly available English g2p corpora:
+* [[UsesDataset::NETtalk corpus | NETtalk corpus]], consisting about 15k grapheme/phoneme word pairs. About 1000 g2p pairs are used as development set. Gold standard manual alignments are available in this corpus.
+* The [[UsesDataset::celex corpus | Celex corpus]], containing about 40k g2p word pairs. Test set is about 15k words in size.
+=== Evaluation Metric ===
+The authors report error rates in terms of phoneme error rate (PER) and word error rate (WER).
+=== Results ===
+Results of the paper are shown in Table 1. The authors compare their approach with other automated alignment generation approaches like joint ''n''-gram modeling, GIZA++ etc.
+[[File:results1-g2p-crf.jpg|center]]
+:::::::::::::::::: <font size="1.8"> Table 1: Effect of various alignments on two g2p tasks.</font>
+Maximization is empirically shown to perform better the summation approach. Their approach compares favorably against joint ''n''-gram sequence modeling approach and [[UsesMethod::Word_Alignments| word alignment]] approach using GIZA++. Summation approach performs better than linear segmentation.
 == Related Papers ==
+[1] [[RelatedPaper::Bisani and Ney, Speech Communication, 2008 | Maximilian Bisani and Hermann Ney. 2008. Joint-sequence models for grapheme-to-phoneme conversion. Speech Communication, vol. 50, no. 5, pp. 434–451, May 2008.]]
+[2] [[RelatedPaper::Jiampojamarn and Kondrak, Interspeech 2009 | Sittichai Jiampojamarn and Grzegorz Kondrak. 2009. Online discriminative training for grapheme-to-phoneme conversion. In Proceedings of ISCA Interspeech, Brighton, U.K., Sept. 2009, pp. 1303–1306.]]

Difference between revisions of "Lehnen et al., ICASSP 2011. Incorporating Alignments into Conditional Random Fields for Grapheme to Phoneme Conversion"

Latest revision as of 22:39, 30 September 2011

Contents

Citation

Online Version

Summary

Method

Alignments

Training

Maximization Approach

Summation Approach

Experiments and Results

Dataset

Evaluation Metric

Results

Related Papers

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools