Difference between revisions of "Reuters 21578"

From Cohen Courses
Jump to navigationJump to search
Line 16: Line 16:
 
Some more information on the distribution:
 
Some more information on the distribution:
  
                    Number of    Number of Categories  Number of Categories  
+
<nowiki>                    Number of    Number of Categories  Number of Categories  
 
Category Set  Categories    w/ 1+ Occurrences      w/ 20+ Occurrences   
 
Category Set  Categories    w/ 1+ Occurrences      w/ 20+ Occurrences   
 
----
 
----
Line 24: Line 24:
 
PEOPLE          267              114                      15
 
PEOPLE          267              114                      15
 
PLACES          175              147                      60
 
PLACES          175              147                      60
TOPICS          135              120                      57
+
TOPICS          135              120                      57</nowiki>
  
 
It is recommended to use the pre-split training-test splits, i.e., either "ModLewis" split or "ModApte" split.
 
It is recommended to use the pre-split training-test splits, i.e., either "ModLewis" split or "ModApte" split.

Revision as of 23:54, 25 September 2011

Citation

Reuters-21578, by D. Lewis, et al. In {{{booktitle}}}, 1987.

The Reuters 21578 dataset is used for text categorization classification, and consist of documents that appeared on the Reuters Newswire in 1987.

The dataset consists of 22 files: The first 21 files contain 1000 documents each, and the 22nd contains 578 documents. The formatting of the data is in SGML format.

The categories in this dataset come from five classes:

  • Exchanges: financial exchanges, e.g., "nasdaq"
  • Organizations: named entities of organizations, e.g., "GE"
  • People: named entities of people, e.g. "Paul Volcker"
  • Places: named entities of places, e.g., "Australia"
  • Topics: economic subject categories, e.g., "coconut", "gold", "money supply"

Some more information on the distribution:

Number of Number of Categories Number of Categories Category Set Categories w/ 1+ Occurrences w/ 20+ Occurrences ---- EXCHANGES 39 32 7 ORGS 56 32 9 PEOPLE 267 114 15 PLACES 175 147 60 TOPICS 135 120 57

It is recommended to use the pre-split training-test splits, i.e., either "ModLewis" split or "ModApte" split.