Difference between revisions of "Cucerzan and Yarowsky, SIGDAT 1999"

From Cohen Courses
Jump to navigationJump to search
Line 8: Line 8:
  
 
== Summary ==
 
== Summary ==
This [[Category::paper]] describes a language independent EM-style bootstrapping algorithm to produce a name entity recognizer. Since some morphological information and contextual patterns are good indicators for certain name entity classes, the bootstrapping algorithm iteratively learns from word internal and contextual information of entities.  
+
This [[Category::paper]] describes a language independent EM-style bootstrapping algorithm to produce a name entity recognizer. Since some morphological information and contextual patterns are good indicators for certain name entity classes, the bootstrapping algorithm iteratively learns from word internal and contextual information of entities. name entity tagging with minimal information on the language
  
 
The authors experimented with five languages; English, Romanian, Greek, Turkish and Hindi. For each entity class, the authors provide short list of seeds. they also used some basic particularities of the language like capitalization, word separators and language related exceptions.  
 
The authors experimented with five languages; English, Romanian, Greek, Turkish and Hindi. For each entity class, the authors provide short list of seeds. they also used some basic particularities of the language like capitalization, word separators and language related exceptions.  
Line 17: Line 17:
 
* Stage 0: Defining the classes and filling the initial class seeds for each language.
 
* Stage 0: Defining the classes and filling the initial class seeds for each language.
 
* Stage 1: Reading the text and building the character based trie structures.  A total of 4 tries are builded; 2 for context (left and right) and 2 for morphological patterns (prefix and suffix)
 
* Stage 1: Reading the text and building the character based trie structures.  A total of 4 tries are builded; 2 for context (left and right) and 2 for morphological patterns (prefix and suffix)
* Stage 2: introduce training information in the tries and re-estimate the distributions by bootstrapping
+
* Stage 2: apply the bootstrapping algorithm and recalculate the probability distributions at each node.
 +
introduce training information in the tries and re-estimate the distributions by bootstrapping
 
* Stage 3: There are 4 classifiers available for each token. All these classifiers are combined to decide on the presence of entity and its class.  
 
* Stage 3: There are 4 classifiers available for each token. All these classifiers are combined to decide on the presence of entity and its class.  
 
* Stage 4: The classified tokens and contexts are saved  
 
* Stage 4: The classified tokens and contexts are saved  
 
    
 
    
 
[[File:Algorithm.png]]
 
 
 
 
For all five languages, using context and morphology tries together give better accuracy then using only one of them. Furthermore boosting improves the results for all languages. Experimenting with train size showed that increasing the train size improves the total accuracy due to more accurate classifications. Also an increase in the length of the provided seed list resulted in improved F-score.
 
For all five languages, using context and morphology tries together give better accuracy then using only one of them. Furthermore boosting improves the results for all languages. Experimenting with train size showed that increasing the train size improves the total accuracy due to more accurate classifications. Also an increase in the length of the provided seed list resulted in improved F-score.
 +
hierarchically smoothed trie structures
 +
person and place
  
 
== Related Papers ==
 
== Related Papers ==

Revision as of 20:02, 26 October 2010

Citation

Cucerzan, S. and Yarowsky, D. 1999. Language independent named entity recognition combining morphological and contextual evidence. In In Proceedings of the Joint SIGDAT Conference on EMNLP and VLC (1999), pp. 90-99..

Online version

ACL Anthology

Summary

This paper describes a language independent EM-style bootstrapping algorithm to produce a name entity recognizer. Since some morphological information and contextual patterns are good indicators for certain name entity classes, the bootstrapping algorithm iteratively learns from word internal and contextual information of entities. name entity tagging with minimal information on the language

The authors experimented with five languages; English, Romanian, Greek, Turkish and Hindi. For each entity class, the authors provide short list of seeds. they also used some basic particularities of the language like capitalization, word separators and language related exceptions. . Therefore for each entity bootstrapping makes use of the word internal and contextual information.

The algorithm used in paper can be described in several steps:

  • Stage 0: Defining the classes and filling the initial class seeds for each language.
  • Stage 1: Reading the text and building the character based trie structures. A total of 4 tries are builded; 2 for context (left and right) and 2 for morphological patterns (prefix and suffix)
  • Stage 2: apply the bootstrapping algorithm and recalculate the probability distributions at each node.

introduce training information in the tries and re-estimate the distributions by bootstrapping

  • Stage 3: There are 4 classifiers available for each token. All these classifiers are combined to decide on the presence of entity and its class.
  • Stage 4: The classified tokens and contexts are saved

For all five languages, using context and morphology tries together give better accuracy then using only one of them. Furthermore boosting improves the results for all languages. Experimenting with train size showed that increasing the train size improves the total accuracy due to more accurate classifications. Also an increase in the length of the provided seed list resulted in improved F-score. hierarchically smoothed trie structures person and place

Related Papers