Difference between revisions of "ANERcorp"

From Cohen Courses
Jump to navigationJump to search
Line 1: Line 1:
training and testing corpus
+
ANERcorp is a manually annotated corpus in Arabic which is created to be used in Arabic NER tasks. It consists of two parts; training and testing. It has been annotated by one person in order to guarantee the coherence of the annotation.
annotated corpus
+
 
manually annotated
+
There are more than 150K tokens in the corpus and 11% of them are Named Entities. Every token in the corpus is annotated with one of the followings; person, location, organization, miscellaneous or other. The distribution of the Named Entities are given below:
 +
 
 +
* Person : 39%
 +
* Location : 30.4%
 +
* Organization : 20.6%
 +
* Miscellaneous : 10%
 +
 
 +
 
 
news wire and other web sources
 
news wire and other web sources
person
 
location
 
organization
 
miscellaneous
 
other
 
  
the annotation was done by one person to guarantee the coherence
+
 
more than 150K tokens in which 11% of them are NE.
+
 
  
 
the corpus has been used in several papers
 
the corpus has been used in several papers
Line 19: Line 21:
 
The corpus consists of 316 articles which have been selected from news wire and other type web sources.  
 
The corpus consists of 316 articles which have been selected from news wire and other type web sources.  
  
* Person : 39%
+
 
* Location : 30.4%
 
* Organization : 20.6%
 
* Miscellaneous : 10%
 
  
 
ANERgazet consists of three different gazetteers, all built manually using web
 
ANERgazet consists of three different gazetteers, all built manually using web

Revision as of 15:14, 30 November 2010

ANERcorp is a manually annotated corpus in Arabic which is created to be used in Arabic NER tasks. It consists of two parts; training and testing. It has been annotated by one person in order to guarantee the coherence of the annotation.

There are more than 150K tokens in the corpus and 11% of them are Named Entities. Every token in the corpus is annotated with one of the followings; person, location, organization, miscellaneous or other. The distribution of the Named Entities are given below:

  • Person : 39%
  • Location : 30.4%
  • Organization : 20.6%
  • Miscellaneous : 10%


news wire and other web sources



the corpus has been used in several papers the data is in standard CONLL format

In Arabic one word may written in several forms. In order to normalize these differences and decrease the data sparseness, the data has been normalized by reducing different forms of words are reduced into one form.

The corpus consists of 316 articles which have been selected from news wire and other type web sources.


ANERgazet consists of three different gazetteers, all built manually using web resources: (i) Location Gazetteer : this gazetteer consists of 1,950 names of continents, countries, cities, rivers and mountains found in the Arabic version of wikipedia16; (ii) Person Gazetteer : this was originally a list of 1,920 complete names of people found in wikipedia and other websites. Splitting the names into first names and last names and omitting the repeated names, the list contains finally 2,309 names; (iii) Organizations Gazetteer : the last gazetteer consists of a list of 262 names of companies, football teams and other organizations.


[download URL]