Chelba and Acero, EMNLP 2004: Adaptation of Maximum Entropy Capitalizer: Little Data Can Help A Lot

Citation

C. Chelba and A. Acero. Adaptation of maximum entropy capitalizer: little data can help a lot, Proceedings of EMNLP, 2004.

Online Version

PDF version

Summary

This paper takes a maximum entropy Markov model (MEMM) for capitalization trained with a large amount of data for one domain, and adapts it to another domain with a relatively small amount of data.

Capitalization

This problem addressed by this paper is automatic capitalization of uniformly cased text, which can be useful in the post-processing of speech recognition transcripts. This problem can be considered as a task of sequence labeling: for each word in a sequence, we need to decide its capitalization among the following cases:

LOC: lower case
CAP: first letter capitalized
AUC: all upper case
MXC: mixed case (e.g. MaxEnt)
PNC: punctuation

A baseline approach is 1-gram capitalization: for each word, choose the most frequent capitalization seen in the training corpus. As an overriding rule, the first word of a sentence is always capitalized.

MEMMs for capitalization

A popular sequential model for capitalization is the Maximum Entropy Markov Model (MEMM). In an MEMM, the capitalization label $y_{i}$ for the $i$ -th word $w_{i}$ depends on several words around $w_{i}$ (including $w_{i}$ itself), and the capitalization labels of several previous words. In this paper, $y_{i}$ is chosen to depend on $\{w_{i-1},w_{i},w_{i+1},y_{i-2},y_{i-1}\}$ , which are collectively denoted by $x_{i}$ (in the paper the $x$ has an underline, which is omitted here).

The conditional probability of the capitalization label $y$ given the observation $x$ is defined as:

$p_{\Lambda }(y|x)=Z^{-1}(x;\Lambda )\cdot \exp \left[\sum _{i=1}^{F}\lambda _{i}f_{i}(x,y)\right]$

$Z^{-1}(x;\Lambda )=\sum _{y}\exp \left[\sum _{i=1}^{F}\lambda _{i}f_{i}(x,y)\right]$

In these formulas, $f_{i}(x,y)$ are features of the union of $x$ and $y$ . In this paper, the features are all binary-valued. They encode the identity of the labels and words, as well the sub-word information such as prefixes and suffixes. The $\lambda _{i}$ are the weights of the features, and need to be trained.

Training an MEMM

The MEMM is usually trained for maximum conditional likelihood of the training data. One can also perform maximum a posteriori (MAP) training by imposing a zero-mean diagonal Gaussian prior $\Lambda \sim {\mathcal {N}}(0,{\text{diag}}(\sigma _{i}^{2}))$ on the feature weights, which results in a quadratic regularization term in the objective function:

$L(\Lambda )=\sum _{x,y}{\tilde {p}}(x,y)\log p_{\Lambda }(y|x)-\sum _{i=1}^{F}{\frac {\lambda _{i}^{2}}{2\sigma _{i}^{2}}}+{\text{const}}(\Lambda )$

where ${\tilde {p}}$ represents the empirical distribution (i.e. that of the training data), and $p_{\Lambda }$ represents the distribution given by the model.

This objective function can be maximized with the Improved Iterative Scaling (IIS) algorithm. In each iteration of this algorithm, $\lambda _{i}$ is increased by $\delta _{i}$ , which is determined by the following equation:

$\sum _{x,y}{\tilde {p}}(x,y)f_{i}(x,y)-{\frac {\lambda _{i}}{\sigma _{i}^{2}}}={\frac {\delta _{i}}{\sigma _{i}^{2}}}+\sum _{x,y}{\tilde {p}}(x)p_{\Lambda }(y|x)f_{i}(x,y)\exp(\delta _{i}f^{\#}(x,y))$

where $f^{\#}(x,y)$ is the total number of features activated by $(x,y)$ .

MAP adaptation of MEMMs

An MEMM trained for one domain usually works less well for a different domain. Even if we have only a little data of the new domain, it may be beneficial to adapt the MEMM to the new domain. This can be realized by training on the new-domain data with a Gaussian prior centered at the parameters of the old model $\Lambda \sim {\mathcal {N}}(\Lambda ^{0},{\text{diag}}(\sigma _{i}^{2}))$ .

The training procedure with this new prior is almost the same as that with a zero-mean prior. The objective function becomes:

$L(\Lambda )=\sum _{x,y}{\tilde {p}}(x,y)\log p_{\Lambda }(y|x)-\sum _{i=1}^{F}{\frac {(\lambda _{i}-\lambda _{i}^{0})^{2}}{2\sigma _{i}^{2}}}+{\text{const}}(\Lambda )$

And the increment $\delta _{i}$ in each iteration is determined by:

$\sum _{x,y}{\tilde {p}}(x,y)f_{i}(x,y)-{\frac {\lambda _{i}-\lambda _{i}^{0}}{\sigma _{i}^{2}}}={\frac {\delta _{i}}{\sigma _{i}^{2}}}+\sum _{x,y}{\tilde {p}}(x)p_{\Lambda }(y|x)f_{i}(x,y)\exp(\delta _{i}f^{\#}(x,y))$

Experiments

Dataset

	Training	Develop.	Testing
WSJ	20M
CNN	73k	73k	73k
ABC	25k	8k	8k

Three data sets are used:

The Wall Street Journal is used as in-domain training data.
Two broadcast news datasets (CNN and ABC) are used as out-of-domain data. Both of the two are divided into training, development and test data. The development data is used for tuning parameters such as the covariance of the prior.

The sizes of the datasets (in words) are listed in the table on the right.

Criterion

The evaluation criterion is the labeling error rate.

Results

Evaluation is carried out on the CNN and ABC test sets. Four systems are evaluated: the 1-gram capitalizer (baseline), the unadapted MEMM trained on WSJ, and two MEMMs adapted from WSJ to CNN and ABC. The error rates are shown in the table to the right. Two conclusions can be drawn:

MEMM performs better than the 1-gram baseline.
Adapted MEMMs perform best for their target domain.

Maigo's Comment

To fully justify adaptation, I think it's still necessary to compare an adapted MEMM with an MEMM trained solely on the data of the second domain (even though the amount of data is small). Unfortunately the authors didn't give the results for the second scenario.

Chelba and Acero, EMNLP 2004: Adaptation of Maximum Entropy Capitalizer: Little Data Can Help A Lot

Contents

Citation

Online Version

Summary

Capitalization

MEMMs for capitalization

Training an MEMM

MAP adaptation of MEMMs

Experiments

Dataset

Criterion

Results

Maigo's Comment

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools