# Gunawardana et al, ICSCT 2005: Hidden Conditional Random Fields for Phone Classification

## Citation

A. Gunawardana, M. Mahajan, A. Acero, J. C. Platt. Hidden conditional random fields for phone classification, International Conference on Speech Communication and Technology, pp. 1117-1120, September 2005.

## Summary

This paper addresses the problem of phone classification: given a sequence of acoustic features (observation vectors), predict the most probable phone. Each phone is modeled as a sequence of states, and the states emit observation vectors according to a Gaussian mixture model (GMM). Therefore the problem involves two latent variables: the state and mixture component at each time frame.

This problem is usually solved with Hidden Markov Models (HMM). Traditionally HMMs are trained for maximum likelihood (ML); "recently" there has been success in discriminative training with objective functions such as maximum mutual information (MMI) and minimum phone error (MPE). But discriminative training for generative models like HMMs requires special algorithms like the Extended Baum-Welch (EBW) algorithm. The authors propose a Hidden Conditional Random Field (HCRF) model which can be trained with general-purpose optimization algorithms such as L-BFGS and Stochastic Gradient Descent (SGD).

### Hidden Conditional Random Fields (HCRF)

For simplicity, the HCRF is first formulated for single Gaussian emission distributions and scalar observations. This eliminates the hidden variable of "component mixture" and leaves only the state. The HCRF gives the conditional probability of a phone label ${\displaystyle w}$ given a sequence of observations ${\displaystyle \mathbf {o} =(o_{1},\ldots ,o_{t})}$:

${\displaystyle p(w|\mathbf {o} ;\lambda )={\frac {1}{z(\mathbf {o} ;\lambda )}}\sum _{\mathbf {s} \in w}\exp\{\lambda \cdot f(w,\mathbf {s} ,\mathbf {o} )\}}$

where ${\displaystyle \mathbf {s} }$ is the hidden state sequence, ${\displaystyle f}$ is the feature vector, ${\displaystyle \lambda }$ is the weight vector, and ${\displaystyle \textstyle z(\mathbf {o} ,\lambda )=\sum _{w,\mathbf {s} \in w}\exp\{\lambda \cdot f(w,\mathbf {s} ,\mathbf {o} )\}}$ is the partition function.

The features ${\displaystyle f}$ used in this paper include language model features (prior probabilities of the phones), transition counts between pairs of states, occupancy features of single states, and first- and second-order moments of the observations for each state:

These features involve at most 2 consecutive states, which implies a Markovian assumption on the state sequence. This makes it possible to perform efficient training and decoding.

### Relationship to HMMs

It can be proved that if the weight vector is set in the following way, the HCRF is equivalent to a ML-trained HMM:

But HMMs can only represent a subset of the conditional probabilities HCRFs can represent, due to their local normalization constraints.

### Training and Decoding

The training objective is to maximize the total conditional log-likelihood of the training data:

${\displaystyle {\mathcal {L}}(\lambda )=\sum _{n=1}^{N}\log p(w^{(n)}|\mathbf {o} ^{(n)};\lambda )}$

This objective function can be maximized with general-purpose optimization algorithms such as L-BFGS and Stochastic Gradient Descent (SGD). In either case, it is necessary to calculate the gradient of the conditional log-likelihood of one training example ${\displaystyle ({\hat {\mathbf {o} }},{\hat {w}})}$ w.r.t. the weight vector:

${\displaystyle \nabla _{\lambda }\log p({\hat {w}}|{\hat {\mathbf {o} }};\lambda )=\sum _{\mathbf {s} \in {\hat {w}}}f({\hat {w}},\mathbf {s} ,{\hat {\mathbf {o} }})p(\mathbf {s} |{\hat {w}},{\hat {\mathbf {o} }};\lambda )-\sum _{w,\mathbf {s} \in w}f(w,\mathbf {s} ,{\hat {\mathbf {o} }})p(w,\mathbf {s} |{\hat {\mathbf {o} }};\lambda )}$

and the two conditional probabilities involved can both be calculated with the Forward-Backward algorithm.

The Viterbi algorithm for decoding in HMMs can also be used for decoding in HCRFs.

## Experiments

### Dataset

Experiments are conducted on the TIMIT phone classification task. Results are reported on the MIT development test set and the NIST core test set. The training, development and evaluation sets consist of 142,910, 15,334 and 7,333 phones respectively.

### Criterion

The evaluation criterion is the phone classification error rate.

### Involved Systems

Four systems are compared:

• HMM(ML): An HMM trained for maximum likelihood;
• HMM(MMI): An HMM trained discriminatively for maximum mutual information;
• HCRF(L-BFGS): An HCRF trained with L-BFGS;
• HCRF(SGD): An HCRF trained with stochastic gradient descent.

The last three systems use the parameters of the HMM(ML) system as initial parameters for training. GMM models with 10, 20, and 40 mixture components are used for all models.

### Results

HCRFs significantly outperform HMMs on both the development and the evaluation test sets. HCRF(SGD) achieves the lowest error rate.