Gunawardana et al, ICSCT 2005: Hidden Conditional Random Fields for Phone Classification

From Cohen Courses
Revision as of 22:10, 22 November 2011 by Yunwang (talk | contribs)
Jump to navigationJump to search

Citation

A. Gunawardana, M. Mahajan, A. Acero, J. C. Platt. Hidden conditional random fields for phone classification, International Conference on Speech Communication and Technology, pp. 1117-1120, September 2005.

Online Version

PDF version

Summary

This paper addresses the problem of phone classification: given a sequence of acoustic features (observation vectors), predict the most probable phone. Each phone is modeled as a sequence of states, and the states emit observation vectors according to a Gaussian mixture model (GMM). Therefore the problem involves two latent variables: the state and mixture component at each time frame.

This problem is usually solved with Hidden Markov Models (HMM). Traditionally HMMs are trained for maximum likelihood (ML); "recently" there has been success in discriminative training with objective functions such as maximum mutual information (MMI) and minimum phone error (MPE). But discriminative training for generative models like HMMs requires special algorithms like the Extended Baum-Welch (EBW) algorithm. The authors propose a Hidden Conditional Random Field (HCRF) model which can be trained with general-purpose optimization algorithms such as L-BFGS and Stochastic Gradient Descent (SGD).

Hidden Conditional Random Fields (HCRF)

For simplicity, the HCRF is first formulated for single Gaussian emission distributions and scalar observations. This eliminates the hidden variable of "component mixture" and leaves only the state. The HCRF gives the conditional probability of a phone label given a sequence of observations :

where is the hidden state sequence, is the feature vector, is the weight vector, and is the partition function.

The features used in this paper include language model features (prior probabilities of the phones), transition counts between pairs of states, occupancy features of single states, and first- and second-order moments of the observations for each state:

HCRF features.png

These features involve at most 2 consecutive states, which implies a Markovian assumption on the state sequence. This makes it possible to perform efficient training and decoding.

Relationship to HMMs

It can be proved that if the weight vector is set in the following way, the HCRF is equivalent to a ML-trained HMM:

HCRF weights.png

But HMMs can only represent a subset of the conditional probabilities HCRFs can represent, due to their local normalization constraints.

Training and Decoding

The training objective is to maximize the total conditional log-likelihood of the training data:

This objective function can be maximized with general-purpose optimization algorithms such as L-BFGS and Stochastic Gradient Descent (SGD). In either case, it is necessary to calculate the gradient of the conditional log-likelihood of one training example w.r.t. the weight vector:

and the two conditional probabilities involved can both be calculated with the Forward-Backward algorithm.

The Viterbi algorithm for decoding in HMMs can also be used for decoding in HCRFs.

Experiments