Difference between revisions of "Reuters 21578"

Latest revision as of 23:02, 25 September 2011

Citation

Reuters-21578, by D. Lewis, et al. In {{{booktitle}}}, 1987.

The Reuters 21578 dataset is used for text categorization classification, and consist of documents that appeared on the Reuters Newswire in 1987.

The dataset consists of 22 files: The first 21 files contain 1000 documents each, and the 22nd contains 578 documents. The formatting of the data is in SGML format.

The categories in this dataset come from five classes:

Exchanges: financial exchanges, e.g., "nasdaq"
Organizations: named entities of organizations, e.g., "GE"
People: named entities of people, e.g. "Paul Volcker"
Places: named entities of places, e.g., "Australia"
Topics: economic subject categories, e.g., "coconut", "gold", "money supply"

Some more information on the distribution:

It is recommended to use the pre-specified training-test splits, i.e., either "ModLewis" split or "ModApte" split.

@@ Line 16: / Line 16: @@
 Some more information on the distribution:
-                     Number of    Number of Categories   Number of Categories
+[[File:reutersdata.png]]
-Category Set  Categories     w/ 1+ Occurrences      w/ 20+ Occurrences
-************  **********   ********************   ********************
-EXCHANGES   39                32                       7
-ORGS             56                32                       9
-PEOPLE          267               114                      15
-PLACES          175               147                      60
-TOPICS          135               120                      57
-It is recommended to use the pre-split training-test splits, i.e., either "ModLewis" split or "ModApte" split.
+It is recommended to use the pre-specified training-test splits, i.e., either "ModLewis" split or "ModApte" split.

Difference between revisions of "Reuters 21578"

Latest revision as of 23:02, 25 September 2011

Citation

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools