Difference between revisions of "Benajiba and Rosso, LREC 2008"

From Cohen Courses
Jump to navigationJump to search
Line 9: Line 9:
 
== Summary ==
 
== Summary ==
 
This [[Category::paper]] describes a [[UsesMethod::Conditional Random Fields]] approach to the Arabic [[AddressesProblem::Named Entity Recognition]] problem. Arabic is a highly inflectional language in which words can take both prefixes and suffixes. In addition to the complex morphology of Arabic, there is also the absence of capital letters which makes NER task even harder.     
 
This [[Category::paper]] describes a [[UsesMethod::Conditional Random Fields]] approach to the Arabic [[AddressesProblem::Named Entity Recognition]] problem. Arabic is a highly inflectional language in which words can take both prefixes and suffixes. In addition to the complex morphology of Arabic, there is also the absence of capital letters which makes NER task even harder.     
 +
 +
Previous to this paper, the authors were using Maximum Entropy model ([[RelatedPaper::Benajiba et al, CICLing 2007]]) with binary features which uses the word itself, the preceding word, the bigrams around the word and external resources. Furthermore in order to ease the difficulty of detecting the named entities, they used a 2-step approach where the first steps focused on detecting the entities and the second step classifies them ([[RelatedPaper::Benajiba and Rosso, IICAI 2007]]). This method improved their f-measure by almost 18%.
  
 
In this paper the authors used Conditional Random Fields. In order to resolve the data sparsity problem they performed word segmentation which is to separate the different components of a word with a space character. For the experiments they used the [[UsesDataset::ANERcorp]]. Four types of named entities (person, location, organization and miscellaneous) were searched.   
 
In this paper the authors used Conditional Random Fields. In order to resolve the data sparsity problem they performed word segmentation which is to separate the different components of a word with a space character. For the experiments they used the [[UsesDataset::ANERcorp]]. Four types of named entities (person, location, organization and miscellaneous) were searched.   
 
They used the part-of-speech tags (POS-tag) and Base Phrase Chunks (BPC). They also used ANERgazet as an external resource and the nationality of the preceding word since person related NE are preceded by nationality most of the time.  
 
They used the part-of-speech tags (POS-tag) and Base Phrase Chunks (BPC). They also used ANERgazet as an external resource and the nationality of the preceding word since person related NE are preceded by nationality most of the time.  
  
Previous to this paper, the authors were using Maximum Entropy model (ME) [[RelatedPaper::Benajiba et al, CICLing 2007]] with binary features which uses the word itself, the preceding word, the bigrams around the word and external resources. Furthermore in order to ease the difficulty of detecting the named entities, they used a 2-step approach where the first steps focused on detecting the entities and the second step classifies them. This method improved the recall by almost ... percent.
 
  
 
Tokenization also improved the results especially the recall.
 
Tokenization also improved the results especially the recall.

Revision as of 13:58, 29 November 2010

Citation

Yassine Benajiba and Paolo Rosso. 2008. Arabic Named Entity Recognition using Conditional Random Fields. In Proc. of Workshop on HLT&NLP within the Arabic World, LREC'08.

Online version

LREC 2008

Summary

This paper describes a Conditional Random Fields approach to the Arabic Named Entity Recognition problem. Arabic is a highly inflectional language in which words can take both prefixes and suffixes. In addition to the complex morphology of Arabic, there is also the absence of capital letters which makes NER task even harder.

Previous to this paper, the authors were using Maximum Entropy model (Benajiba et al, CICLing 2007) with binary features which uses the word itself, the preceding word, the bigrams around the word and external resources. Furthermore in order to ease the difficulty of detecting the named entities, they used a 2-step approach where the first steps focused on detecting the entities and the second step classifies them (Benajiba and Rosso, IICAI 2007). This method improved their f-measure by almost 18%.

In this paper the authors used Conditional Random Fields. In order to resolve the data sparsity problem they performed word segmentation which is to separate the different components of a word with a space character. For the experiments they used the ANERcorp. Four types of named entities (person, location, organization and miscellaneous) were searched. They used the part-of-speech tags (POS-tag) and Base Phrase Chunks (BPC). They also used ANERgazet as an external resource and the nationality of the preceding word since person related NE are preceded by nationality most of the time.


Tokenization also improved the results especially the recall.