Difference between revisions of "ANERcorp"

From Cohen Courses
Jump to navigationJump to search
Line 18: Line 18:
 
The corpus consists of 316 articles which have been selected from news wire and other type web sources.  
 
The corpus consists of 316 articles which have been selected from news wire and other type web sources.  
  
* Person 39%
+
* Person : 39%
* Location 30.4%
+
* Location : 30.4%
* Organization 20.6%
+
* Organization : 20.6%
* Miscellaneous 10%
+
* Miscellaneous : 10%
  
 
[[http://users.dsic.upv.es/~ybenajiba/ download URL]]
 
[[http://users.dsic.upv.es/~ybenajiba/ download URL]]

Revision as of 16:15, 29 November 2010

training and testing corpus annotated corpus manually annotated news wire and other web sources person location organization miscellaneous other

more than 150K tokens in which 11% of them are NE.

the corpus has been used in several papers the data is in standard CONLL format

In Arabic one word may written in several forms. In order to normalize these differences and decrease the data sparseness, the data has been normalized by reducing different forms of words are reduced into one form.

The corpus consists of 316 articles which have been selected from news wire and other type web sources.

  • Person : 39%
  • Location : 30.4%
  • Organization : 20.6%
  • Miscellaneous : 10%

[download URL]