# Chelba and Acero, EMNLP 2004: Adaptation of Maximum Entropy Capitalizer: Little Data Can Help A Lot

## Citation

C. Chelba and A. Acero. Adaptation of maximum entropy capitalizer: little data can help a lot, Proceedings of EMNLP, 2004.

## Summary

This paper takes a maximum entropy Markov model (MEMM) for capitalization trained with a large amount of data for one domain, and adapts it to another domain with a relatively small amount of data.

### Capitalization

This problem addressed by this paper is automatic capitalization of uniformly cased text, which can be useful in the post-processing of speech recognition transcripts. This problem can be considered as a task of sequence labeling: for each word in a sequence, we need to decide its capitalization among the following cases:

• LOC: lower case
• CAP: first letter capitalized
• AUC: all upper case
• MXC: mixed case (e.g. MaxEnt)
• PNC: punctuation

A baseline approach is 1-gram capitalization: for each word, choose the most frequent capitalization seen in the training corpus. As an overriding rule, the first word of a sentence is always capitalized.

### MEMMs for capitalization

A popular sequential model for capitalization is the Maximum Entropy Markov Model (MEMM). In an MEMM, the capitalization label ${\displaystyle y_{i}}$ for the ${\displaystyle i}$-th word ${\displaystyle w_{i}}$ depends on several words around ${\displaystyle w_{i}}$ (including ${\displaystyle w_{i}}$ itself), and the capitalization labels of several previous words. In this paper, ${\displaystyle y_{i}}$ is chosen to depend on ${\displaystyle \{w_{i-1},w_{i},w_{i+1},y_{i-2},y_{i-1}\}}$, which are collectively denoted by ${\displaystyle x_{i}}$ (in the paper the ${\displaystyle x}$ has an underline, which is omitted here).

The conditional probability of the capitalization label ${\displaystyle y}$ given the observation ${\displaystyle x}$ is defined as:

${\displaystyle p_{\Lambda }(y|x)=Z^{-1}(x;\Lambda )\cdot \exp \left[\sum _{i=1}^{F}\lambda _{i}f_{i}(x,y)\right]}$

${\displaystyle Z^{-1}(x;\Lambda )=\sum _{y}\exp \left[\sum _{i=1}^{F}\lambda _{i}f_{i}(x,y)\right]}$

In these formulas, ${\displaystyle f_{i}(x,y)}$ are features of the union of ${\displaystyle x}$ and ${\displaystyle y}$. In this paper, the features are all binary-valued. They encode the identity of the labels and words, as well the sub-word information such as prefixes and suffixes. The ${\displaystyle \lambda _{i}}$ are the weights of the features, and need to be trained.

### Training an MEMM

The MEMM is usually trained for maximum conditional likelihood of the training data. One can also perform maximum a posteriori (MAP) training by imposing a zero-mean diagonal Gaussian prior ${\displaystyle \Lambda \sim {\mathcal {N}}(0,{\text{diag}}(\sigma _{i}^{2}))}$ on the feature weights, which results in a quadratic regularization term in the objective function:

${\displaystyle L(\Lambda )=\sum _{x,y}{\tilde {p}}(x,y)\log p_{\Lambda }(y|x)-\sum _{i=1}^{F}{\frac {\lambda _{i}^{2}}{2\sigma _{i}^{2}}}+{\text{const}}(\Lambda )}$

where ${\displaystyle {\tilde {p}}}$ represents the empirical distribution (i.e. that of the training data), and ${\displaystyle p_{\Lambda }}$ represents the distribution given by the model.

This objective function can be maximized with the Improved Iterative Scaling (IIS) algorithm. In each iteration of this algorithm, ${\displaystyle \lambda _{i}}$ is increased by ${\displaystyle \delta _{i}}$, which is determined by the following equation:

${\displaystyle \sum _{x,y}{\tilde {p}}(x,y)f_{i}(x,y)-{\frac {\lambda _{i}}{\sigma _{i}^{2}}}={\frac {\delta _{i}}{\sigma _{i}^{2}}}+\sum _{x,y}{\tilde {p}}(x)p_{\Lambda }(y|x)f_{i}(x,y)\exp(\delta _{i}f^{\#}(x,y))}$

where ${\displaystyle f^{\#}(x,y)}$ is the total number of features activated by ${\displaystyle (x,y)}$.

An MEMM trained for one domain usually works less well for a different domain. Even if we have only a little data of the new domain, it may be beneficial to adapt the MEMM to the new domain. This can be realized by training on the new-domain data with a Gaussian prior centered at the parameters of the old model ${\displaystyle \Lambda \sim {\mathcal {N}}(\Lambda ^{0},{\text{diag}}(\sigma _{i}^{2}))}$.

The training procedure with this new prior is almost the same as that with a zero-mean prior. The objective function becomes:

${\displaystyle L(\Lambda )=\sum _{x,y}{\tilde {p}}(x,y)\log p_{\Lambda }(y|x)-\sum _{i=1}^{F}{\frac {(\lambda _{i}-\lambda _{i}^{0})^{2}}{2\sigma _{i}^{2}}}+{\text{const}}(\Lambda )}$

And the increment ${\displaystyle \delta _{i}}$ in each iteration is determined by:

${\displaystyle \sum _{x,y}{\tilde {p}}(x,y)f_{i}(x,y)-{\frac {\lambda _{i}-\lambda _{i}^{0}}{\sigma _{i}^{2}}}={\frac {\delta _{i}}{\sigma _{i}^{2}}}+\sum _{x,y}{\tilde {p}}(x)p_{\Lambda }(y|x)f_{i}(x,y)\exp(\delta _{i}f^{\#}(x,y))}$

## Experiments

### Dataset

Training Develop. Testing
WSJ 20M
CNN 73k 73k 73k
ABC 25k 8k 8k

Three data sets are used:

• The Wall Street Journal is used as in-domain training data.
• Two broadcast news datasets (CNN and ABC) are used as out-of-domain data. Both of the two are divided into training, development and test data. The development data is used for tuning parameters such as the covariance of the prior.

The sizes of the datasets (in words) are listed in the table on the right.

### Criterion

The evaluation criterion is the labeling error rate.

### Results

Evaluation is carried out on the CNN and ABC test sets. Four systems are evaluated: the 1-gram capitalizer (baseline), the unadapted MEMM trained on WSJ, and two MEMMs adapted from WSJ to CNN and ABC. The error rates are shown in the table to the right. Two conclusions can be drawn:

• MEMM performs better than the 1-gram baseline.
• Adapted MEMMs perform best for their target domain.

## Maigo's Comment

To fully justify adaptation, I think it's still necessary to compare an adapted MEMM with an MEMM trained solely on the data of the second domain (even though the amount of data is small). Unfortunately the authors didn't give the results for the second scenario.