Difference between revisions of "Penn Treebank"

From Cohen Courses
Jump to navigationJump to search
 
Line 14: Line 14:
 
== Corpora  ==
 
== Corpora  ==
 
Annotated corpus include:
 
Annotated corpus include:
* Wall Street Journal;
+
* Wall Street Journal (WSJ);
 
* The Brown Corpus;
 
* The Brown Corpus;
 
* Switchboard;
 
* Switchboard;

Latest revision as of 18:09, 30 September 2011

The Penn Treebank Project is the first large-scale treebank dataset annotates phrase structure and Part of Speech Tagging for natural language.

Example

For example, the sentence "John loves Mary" will be labelled like the following:

(S (NP (NNP John))
   (VP (VPZ loves)
       (NP (NNP Mary)))
   (. .))

POS tags

format

Corpora

Annotated corpus include:

  • Wall Street Journal (WSJ);
  • The Brown Corpus;
  • Switchboard;
  • ATIS

Relevant Papers