Difference between revisions of "ANERcorp"

From Cohen Courses
Jump to navigationJump to search
(Created page with 'training and testing corpus annotated corpus manually annotated news wire and other web sources person location organization miscellaneous other more than 150K tokens in which 1…')
 
 
(11 intermediate revisions by the same user not shown)
Line 1: Line 1:
training and testing corpus
+
ANERcorp is a manually annotated corpus in Arabic which is created to be used in Arabic NER tasks. It consists of two parts; training and testing. It has been annotated by one person in order to guarantee the coherence of the annotation.
annotated corpus
 
manually annotated
 
news wire and other web sources
 
person
 
location
 
organization
 
miscellaneous
 
other
 
  
more than 150K tokens in which 11% of them are NE.  
+
There are more than 150K tokens in the corpus and 11% of them are Named Entities. Every token in the corpus is annotated with one of the followings; person, location, organization, miscellaneous or other. The distribution of the Named Entities are given below:
  
the corpus has been used in several papers
+
* Person : 39%
 +
* Location : 30.4%
 +
* Organization : 20.6%
 +
* Miscellaneous : 10%
  
[[http://users.dsic.upv.es/~ybenajiba/ download URL]]
+
The corpus consists of 316 articles which have been selected from news wire and other type web sources. Before tagging the corpus, a p reprocessing had been applied to the data. In Arabic one word may written in several forms. In order to normalize these differences and decrease the data sparseness, the data had been normalized by reducing different forms of words into one form.
 +
 
 +
The corpus is publicly available and can be downloaded from [[http://users.dsic.upv.es/~ybenajiba/ download URL]]. It is in standard CONLL format.
 +
 +
The data set has been used in several papers such as [[RelatedPaper::Benajiba et al, CICLing 2007]], [[RelatedPaper::Benajiba and Rosso, IICAI 2007]], [[RelatedPaper::Benajiba and Rosso, LREC 2008]].

Latest revision as of 15:39, 30 November 2010

ANERcorp is a manually annotated corpus in Arabic which is created to be used in Arabic NER tasks. It consists of two parts; training and testing. It has been annotated by one person in order to guarantee the coherence of the annotation.

There are more than 150K tokens in the corpus and 11% of them are Named Entities. Every token in the corpus is annotated with one of the followings; person, location, organization, miscellaneous or other. The distribution of the Named Entities are given below:

  • Person : 39%
  • Location : 30.4%
  • Organization : 20.6%
  • Miscellaneous : 10%

The corpus consists of 316 articles which have been selected from news wire and other type web sources. Before tagging the corpus, a p reprocessing had been applied to the data. In Arabic one word may written in several forms. In order to normalize these differences and decrease the data sparseness, the data had been normalized by reducing different forms of words into one form.

The corpus is publicly available and can be downloaded from [download URL]. It is in standard CONLL format.

The data set has been used in several papers such as Benajiba et al, CICLing 2007, Benajiba and Rosso, IICAI 2007, Benajiba and Rosso, LREC 2008.