Difference between revisions of "Benajiba and Rosso, LREC 2008"

From Cohen Courses
Jump to navigationJump to search
Line 10: Line 10:
 
This [[Category::paper]] describes a [[UsesMethod::Conditional Random Fields]] approach to the Arabic [[AddressesProblem::Named Entity Recognition]] problem. Arabic is a highly inflectional language in which words can take both prefixes and suffixes. In addition to the complex morphology of Arabic, there is also the absence of capital letters which is a significant feature for NER.     
 
This [[Category::paper]] describes a [[UsesMethod::Conditional Random Fields]] approach to the Arabic [[AddressesProblem::Named Entity Recognition]] problem. Arabic is a highly inflectional language in which words can take both prefixes and suffixes. In addition to the complex morphology of Arabic, there is also the absence of capital letters which is a significant feature for NER.     
  
 +
In this paper the authors used Conditional Random Fields. In order to resolve the data sparsity problem they performed word segmentation which is to separate the different components of a word with a space character. For the experiments they used the [[UsesDataset::ANERcorp]]. Four types of named entities (person, location, organization and miscellaneous) were searched. 
 +
They used the part-of-speech tags (POS-tag) and Base Phrase Chunks (BPC). They also used ANERgazet as an external resource and the nationality of the preceding word since person related NE are preceded by nationality most of the time.
  
In this paper the authors used Conditional Random Fields. In order to resolve the data sparsity problem they performed word segmentation which is to separate the different components of a word with a space character. For the experiments they used the [[UsesDataset::ANERcorp]]. Four types of named entities (person, location, organization and miscellaneous) were searched.
+
Previous to this paper, the authors were using Maximum Entropy model (ME) [[RelatedPaper::]] with binary features which uses the word itself, the preceding word, the bigrams around the word and external resources. Furthermore in order to ease the difficulty of detecting the named entities, they used a 2-step approach where the first steps focused on detecting the entities and the second step classifies them. This method improved the recall by almost ... percent.  
  
Previous to this paper, the authors were using Maximum Entropy model [[RelatedPaper::]] with binary features which uses the word itself, the preceding word, the bigrams around the word and external resources. Furthermore in order to ease the difficulty of detecting the named entities, they used a 2-step approach where the first steps focused on detecting the entities and the second step classifies them.
+
Tokenization also improved the results especially the recall.

Revision as of 06:32, 28 November 2010

Citation

Yassine Benajiba and Paolo Rosso. 2008. Arabic Named Entity Recognition using Conditional Random Fields. In Proc. of Workshop on HLT&NLP within the Arabic World, LREC'08.

Online version

LREC 2008

Summary

This paper describes a Conditional Random Fields approach to the Arabic Named Entity Recognition problem. Arabic is a highly inflectional language in which words can take both prefixes and suffixes. In addition to the complex morphology of Arabic, there is also the absence of capital letters which is a significant feature for NER.

In this paper the authors used Conditional Random Fields. In order to resolve the data sparsity problem they performed word segmentation which is to separate the different components of a word with a space character. For the experiments they used the ANERcorp. Four types of named entities (person, location, organization and miscellaneous) were searched. They used the part-of-speech tags (POS-tag) and Base Phrase Chunks (BPC). They also used ANERgazet as an external resource and the nationality of the preceding word since person related NE are preceded by nationality most of the time.

Previous to this paper, the authors were using Maximum Entropy model (ME) with binary features which uses the word itself, the preceding word, the bigrams around the word and external resources. Furthermore in order to ease the difficulty of detecting the named entities, they used a 2-step approach where the first steps focused on detecting the entities and the second step classifies them. This method improved the recall by almost ... percent.

Tokenization also improved the results especially the recall.