# Bikel et al MLJ 1999

## Citation

D. M. Bikel, R. L. Schwartz, and R. M. Weischedel. An algorithm that learns what's in a name. Machine Learning Journal, 34: 211-231, 1999.

## Summary

In this paper the authors present IdentiFinder, an Hidden Markov Model approach to the Named Entity Recognition problem. Most techniques used in Named Entity Recognition until the time of the paper, were mainly based on handcrafted patterns that are completely language dependent, and not flexible to different inputs (speech input, upper case texts, etc).

This was the first paper that addressed Named Entity Recognition with HMM's, recognizing a structure in the identification of named entities, formulating it as a classification problem where a word is either part of some class or belongs to a specific class "NOT-A-NAME".

The proposed solution can be efficiently trained, achieving results equivalent to the best NE solution that existed so far (aprox. 95%). The latter was a hand-crafted solution that needed experts to maintain the rules, being completely language dependent.The authors also show how the language dependent component of the model affect the performance of the solution, proving that although it helps, it is not nearly as significant as the other parts of the model.

As future work the authors point to the development of a hierarchical model to capture neste names (like "Bank of Boston") and modifications to the model that allow longer distance information.

## Brief Description of the Method

Their solution had a model for each name-class and a model for the not-a-name text. Additionally, there are tow special states, the START-OF-SENTENCE and END-OF-SENTENCE. The figure below provides a graphical representation of the model (the dashed edges assure the completion of the graph).

Each of the regions in the above graph was modeled with a different statistical bigram language model (likelihood of words occurring within that region), meaning that each type of name is considered a different language, with separate bigram probabilities. Formally, one is trying to find the most likely sequence of name classes ${\displaystyle NC}$ given a sequence of words ${\displaystyle NC}$. Using Viterbi algorithm one can then search the entire space of all possible name-class assignments, that maximize the following equation

${\displaystyle \max P(NC|W)=\max {\frac {P(W,NC)}{P(W)}}=\max P(W,NC)}$

Additionally, the authors represented words as two-element vectors. ${\displaystyle \left\langle w,f\right\rangle }$ represents a word occurrence where ${\displaystyle w}$ is the text of the word and ${\displaystyle f}$ is a feature that is assigned to it. The set of features as long as the motivation behind them can be found in the figure below. These feature require a language dependent computation, but that is simple and deterministic.

The final model that is presented is divided in two smaller models: the Top Level Model and Back-off Model & Smoothing.

The Top Level Model, the most accurate and powerful, generates the words and name-classes following three steps:

• (1) Select a name-class, conditioning on the previous class and previous word, i.e., ${\displaystyle P(NC|NC_{-1},w_{-1})={\frac {c(NC,NC_{-1},w_{-1})}{c(NC_{-1},w_{-1})}}}$.
• (2) Generate the first word inside that name-class, conditioning on the current and previous name-classes, i.e., ${\displaystyle P(\left\langle w,f\right\rangle |NC,NC_{-1})={\frac {c(\left\langle w,f\right\rangle ,NC,NC_{-1})}{c(NC,NC_{-1})}}}$.
• (3) Generate all subsequent words inside that name-class, conditioned on its predecessor, i.e., ${\displaystyle P(\left\langle w,f\right\rangle |\left\langle w,f\right\rangle _{-1},NC)={\frac {c(\left\langle w,f\right\rangle ,\left\langle w,f\right\rangle _{-1}NC)}{c(\left\langle w,f\right\rangle _{-1},NC)}}}$.

where ${\displaystyle c(.)}$ is the ${\displaystyle count}$ function (number of times the events occurred in the training data.

The Back-Off Model & Smoothing is responsible to deal with the fact that there is not enough training data to estimate accurate probabilities for all possibilities. This can happen whether when there is a word not seen in training or an event with insufficient data to predict. This model adopts a back-off approach, going back to unigrams, or smoothing techniques.

## Experimental Results

The model was tested with MUC-6 dataset, a collection of 30 Wall Street Journal documents. The authors compared the performance of their model in comparison with the best NE-system so far, a rule-based system. They also tested their solution across different types of input material (mixed case, upper case and speech form) and with a different language (Spanish). The results are shown in the table below:

In the Mixed Case setting the best rules system performed better than the proposed solution in the present paper, although the different is not statistically significant, and does not compensate the effort of having experts maintaining the set of rules. The authors justify the low score for Spanish with the low quantity and quality (inconsistencies) in the training data of the Spanish model.

Another result that came out of this work was the fact that 100k words of training seems to suffice to obtain state-of-the-art results for the NE task.