Difference between revisions of "Cucerzan and Yarowsky, SIGDAT 1999"
PastStudents (talk | contribs) |
PastStudents (talk | contribs) |
||
Line 8: | Line 8: | ||
== Summary == | == Summary == | ||
− | This [[Category::paper]] describes a language independent EM-style bootstrapping algorithm to produce a name entity recognizer. | + | This [[Category::paper]] describes a language independent EM-style bootstrapping algorithm to produce a name entity recognizer. The bootstrapping algorithm iteratively learns from word internal and contextual information of entities since some morphological information and contextual patterns are good indicators for certain name entity classes. They captures these morphological and contextual evidence in hierarchically smoothed trie structures. |
The authors experimented with five languages; English, Romanian, Greek, Turkish and Hindi. With minimal information on these languages, two name entity classes (person and place) are being searched in the text. For each entity class, the authors provide a short list of unambiguous seeds and they also used some basic particularities of the language like capitalization, word separators and language related exceptions. | The authors experimented with five languages; English, Romanian, Greek, Turkish and Hindi. With minimal information on these languages, two name entity classes (person and place) are being searched in the text. For each entity class, the authors provide a short list of unambiguous seeds and they also used some basic particularities of the language like capitalization, word separators and language related exceptions. | ||
− | |||
− | |||
The algorithm used in paper can be described in several steps: | The algorithm used in paper can be described in several steps: | ||
Line 24: | Line 22: | ||
For all five languages, using context and morphology tries together give better accuracy then using only one of them. Furthermore boosting improves the results for all languages. Experimenting with train size showed that increasing the train size improves the total accuracy due to more accurate classifications. Also an increase in the length of the provided seed list resulted in improved F-score. | For all five languages, using context and morphology tries together give better accuracy then using only one of them. Furthermore boosting improves the results for all languages. Experimenting with train size showed that increasing the train size improves the total accuracy due to more accurate classifications. Also an increase in the length of the provided seed list resulted in improved F-score. | ||
− | |||
== Related Papers == | == Related Papers == |
Revision as of 20:08, 26 October 2010
Citation
Cucerzan, S. and Yarowsky, D. 1999. Language independent named entity recognition combining morphological and contextual evidence. In In Proceedings of the Joint SIGDAT Conference on EMNLP and VLC (1999), pp. 90-99..
Online version
Summary
This paper describes a language independent EM-style bootstrapping algorithm to produce a name entity recognizer. The bootstrapping algorithm iteratively learns from word internal and contextual information of entities since some morphological information and contextual patterns are good indicators for certain name entity classes. They captures these morphological and contextual evidence in hierarchically smoothed trie structures.
The authors experimented with five languages; English, Romanian, Greek, Turkish and Hindi. With minimal information on these languages, two name entity classes (person and place) are being searched in the text. For each entity class, the authors provide a short list of unambiguous seeds and they also used some basic particularities of the language like capitalization, word separators and language related exceptions.
The algorithm used in paper can be described in several steps:
- Stage 0: Defining the classes and filling the initial class seeds for each language.
- Stage 1: Reading the text and building the character based trie structures. A total of 4 tries are builded; 2 for context (left and right) and 2 for morphological patterns (prefix and suffix)
- Stage 2: apply the bootstrapping algorithm and recalculate the probability distributions at each node.
introduce training information in the tries and re-estimate the distributions by bootstrapping
- Stage 3: There are 4 classifiers available for each token. All these classifiers are combined to decide on the presence of entity and its class.
- Stage 4: The classified tokens and contexts are saved
For all five languages, using context and morphology tries together give better accuracy then using only one of them. Furthermore boosting improves the results for all languages. Experimenting with train size showed that increasing the train size improves the total accuracy due to more accurate classifications. Also an increase in the length of the provided seed list resulted in improved F-score.