Benajiba and Rosso, LREC 2008

From Cohen Courses
Jump to navigationJump to search

Citation

Yassine Benajiba and Paolo Rosso. 2008. Arabic Named Entity Recognition using Conditional Random Fields. In Proc. of Workshop on HLT&NLP within the Arabic World, LREC'08.

Online version

LREC 2008

Summary

This paper describes a Conditional Random Fields approach to the Arabic Named Entity Recognition problem. Arabic is a highly inflectional language in which words can take both prefixes and suffixes. In addition to the complex morphology of Arabic, there is also the absence of capital letters which makes NER task even harder.

Previous to this paper, the authors were using Maximum Entropy model (Benajiba et al, CICLing 2007) with binary features which uses the word itself, the preceding word, the bigrams around the word and external resources. Furthermore in order to ease the difficulty of detecting the named entities, they used a 2-step approach where the first steps focused on detecting the entities and the second step classifies them (Benajiba and Rosso, IICAI 2007). This method improved their f-measure by almost 18%.

In this paper the authors used Conditional Random Fields instead of the Maximum Entropy model. In order to resolve the data sparsity problem they performed word segmentation which is to separate the different components of a word with a space character. They performed their experiments on the ANERcorp dataset. Four types of named entities (person, location, organization and miscellaneous) were tagged.

They used the part-of-speech tags (POS-tag) and Base Phrase Chunks (BPC). They also used ANERgazet as an external resource and the nationality of the preceding word since person related NE are preceded by nationality most of the time.

The experiment results showed that tokenization improves the performance especially the recall. Among the individual features, it has been observed that POS-tag gives the overall best improvement, even though the decrease in precision, it increases the recall highly. The system gets it best performance when it uses all features. Overall, using CRF instead of ME model gave 2 points increase, using tokenization resulted in 3 points improvement and finally using features increased the performance by 9 points.