# Chelba and Acero, EMNLP 2004: Adaptation of Maximum Entropy Capitalizer: Little Data Can Help A Lot

## Contents

## Citation

C. Chelba and A. Acero. **Adaptation of maximum entropy capitalizer: little data can help a lot**, *Proceedings of EMNLP*, 2004.

## Online Version

## Summary

This paper takes a maximum entropy Markov model (MEMM) for capitalization trained with a large amount of data for one domain, and adapts it to another domain with a relatively small amount of data.

### Capitalization

This problem addressed by this paper is automatic capitalization of uniformly cased text, which can be useful in the post-processing of speech recognition transcripts. This problem can be considered as a task of sequence labeling: for each word in a sequence, we need to decide its capitalization among the following cases:

- LOC: lower case
- CAP: first letter capitalized
- AUC: all upper case
- MXC: mixed case (e.g. MaxEnt)
- PNC: punctuation

A baseline approach is 1-gram capitalization: for each word, choose the most frequent capitalization seen in the training corpus. As an overriding rule, the first word of a sentence is always capitalized.

### MEMMs for capitalization

A popular sequential model for capitalization is the Maximum Entropy Markov Model (MEMM). In an MEMM, the capitalization label for the -th word depends on several words around (including itself), and the capitalization labels of several previous words. In this paper, is chosen to depend on , which are collectively denoted by (in the paper the has an underline, which is omitted here).

The conditional probability of the capitalization label given the observation is defined as:

In these formulas, are features of the union of and . In this paper, the features are all binary-valued. They encode the identity of the labels and words, as well the sub-word information such as prefixes and suffixes. The are the weights of the features, and need to be trained.

### Training an MEMM

The MEMM is usually trained for maximum conditional likelihood of the training data. One can also perform maximum a posteriori (MAP) training by imposing a zero-mean diagonal Gaussian prior on the feature weights, which results in a quadratic regularization term in the objective function:

where represents the empirical distribution (i.e. that of the training data), and represents the distribution given by the model.

This objective function can be maximized with the Improved Iterative Scaling (IIS) algorithm. In each iteration of this algorithm, is increased by , which is determined by the following equation:

where is the total number of features activated by .

### MAP adaptation of MEMMs

An MEMM trained for one domain usually works less well for a different domain. Even if we have only a little data of the new domain, it may be beneficial to adapt the MEMM to the new domain. This can be realized by training on the new-domain data with a Gaussian prior centered at the parameters of the old model .

The training procedure with this new prior is almost the same as that with a zero-mean prior. The objective function becomes:

And the increment in each iteration is determined by:

## Experiments

### Dataset

Training | Develop. | Testing | |
---|---|---|---|

WSJ | 20M | ||

CNN | 73k | 73k | 73k |

ABC | 25k | 8k | 8k |

Three data sets are used:

- The Wall Street Journal is used as in-domain training data.
- Two broadcast news datasets (CNN and ABC) are used as out-of-domain data. Both of the two are divided into training, development and test data. The development data is used for tuning parameters such as the covariance of the prior.

The sizes of the datasets (in words) are listed in the table on the right.

### Criterion

The evaluation criterion is the labeling error rate.

### Results

Evaluation is carried out on the CNN and ABC test sets. Four systems are evaluated: the 1-gram capitalizer (baseline), the unadapted MEMM trained on WSJ, and two MEMMs adapted from WSJ to CNN and ABC. The error rates are shown in the table to the right. Two conclusions can be drawn:

- MEMM performs better than the 1-gram baseline.
- Adapted MEMMs perform best for their target domain.

## Maigo's Comment

To fully justify adaptation, I think it's still necessary to compare an adapted MEMM with an MEMM trained solely on the data of the second domain (even though the amount of data is small). Unfortunately the authors didn't give the results for the second scenario.