Cucerzan and Yarowsky, SIGDAT 1999
Cucerzan, S. and Yarowsky, D. 1999. Language independent named entity recognition combining morphological and contextual evidence. In In Proceedings of the Joint SIGDAT Conference on EMNLP and VLC (1999), pp. 90-99..
This paper describes a language independent EM-style bootstrapping algorithm to produce a Name Entity Recognition tool. The bootstrapping algorithm iteratively learns from word internal and contextual information of entities since some morphological information and contextual patterns are good indicators for certain name entity classes. They captures these morphological and contextual evidence in hierarchically smoothed trie structures.
The authors experimented with five languages; English, Romanian, Greek, Turkish and Hindi. With minimal information on these languages, two name entity classes (person and place) are being searched in the text. For each entity class, the authors provide a short list of unambiguous seeds and they also used some basic particularities of the language like capitalization, word separators and language related exceptions.
The algorithm used in paper can be described in several steps:
- Stage 0: Define the classes and fill the initial class seeds for each language.
- Stage 1: Read the text and build the character based trie structures. A total of 4 tries are built; 2 for context (left and right) and 2 for morphological patterns (prefix and suffix). Each node stores a probability distribution which contains the probability of each name class given the history ending at that node.
- Stage 2: Apply the bootstrapping algorithm and recalculate the probability distributions at each node at each cycle. In a cycle, as contextual models become better estimated, named entities are identified with more confidence and these new entities improves the morphological models which resulted in re-estimation of contextual models.
- Stage 3: There are 4 classifiers available for each token. Combine all these classifiers to decide on the presence of entity and its class.
- Stage 4: Save the classified tokens and contexts.
For all five languages, using context and morphology tries together give better accuracy then using only one of them. Furthermore bootstrapping improves the results for all languages. Experimenting with train size showed that increasing the train size improves the total accuracy due to more accurate classifications. Also an increase in the length of the provided seed list resulted in improved F-score.
Several years later the authors applied a similar approach of using internal and contextual evidence on Spanish and Dutch at Cucerzan and Yarowsky, COLING 2002 paper.