Difference between revisions of "ANERcorp"
From Cohen Courses
Jump to navigationJump to searchPastStudents (talk | contribs) (Created page with 'training and testing corpus annotated corpus manually annotated news wire and other web sources person location organization miscellaneous other more than 150K tokens in which 1…') |
PastStudents (talk | contribs) |
||
Line 12: | Line 12: | ||
the corpus has been used in several papers | the corpus has been used in several papers | ||
+ | the data is in standard CONLL format | ||
+ | |||
+ | In Arabic one word may written in several forms. In order to normalize these differences and decrease the data sparseness, the data has been normalized by reducing different forms of words are reduced into one form. | ||
[[http://users.dsic.upv.es/~ybenajiba/ download URL]] | [[http://users.dsic.upv.es/~ybenajiba/ download URL]] |
Revision as of 15:12, 29 November 2010
training and testing corpus annotated corpus manually annotated news wire and other web sources person location organization miscellaneous other
more than 150K tokens in which 11% of them are NE.
the corpus has been used in several papers the data is in standard CONLL format
In Arabic one word may written in several forms. In order to normalize these differences and decrease the data sparseness, the data has been normalized by reducing different forms of words are reduced into one form.