Some possible datasets

In general, a nice way to find already-made datasets is to read papers in the literature and see what they use and reference. A few ideas to get you started:

The CoNLL Shared Tasks (scroll down the page) have been running since 1999 and provide nice freely available annotated data. Phrase chunking, named entity tagging, dependency parsing, semantic role labeling and more.
Templated information extraction. CMU Seminar and Acquisitions datasets (Freitag). Also various MUC competition datasets (see e.g. old-school Hobbs 1995, or new-school Chambers and Jurafsky 2011).
Named entity recognition. e.g. CoNLL shared task 2002.
Segmentation. E.g. Classified ad segmentation data (Grenager). Or the CoNLL chunking tasks. Discourse-topic segmentation for lecture transcripts (Eisenstein). Others?
Part-of-speech tagging. E.g. CMU ARK Twitter annotated POS dataset, or Brown corpus, any Penn Treebank or other treebank, many (or all?) of the CoNLL datasets, etc.
Semantic role labeling. CoNLL shared task in 2004, 2005.
Parsing.
- Dependency parsing: CoNLL shared tasks in 2006, and maybe 2007 too?
- Phrase structure parsing: an annotated dataset is called a "treebank". There are many treebanks, but the Penn Treebank is most famous. It is often used for part-of-speech tagging experiments too.
- Noah requests: do not do phrase structure parsing on the English Wall Street Journal Penn Treebank, it's been overdone. (Chinese or anything else is fine, or English dependency parsing is fine.)
Coreference. For within-document coreference, the MUC and ACE datasets are standard. For cross-document coreference, the "John Smith" dataset is linked here.
BioNLP: The Genia corpus has tons of syntactic and relational annotations, and an ontology too.
Logical-form semantic parsing: GeoQuery dataset (Mooney) has been around for a while, and has had some interesting recent work. There's also one called the "jobs dataset" used in Liang 2011 and other papers.
Frame-semantic parsing: related to semantic role labeling, often related to FrameNet or VerbNet. There is a small amount FrameNet data (see Das et al 2010 for pointers)
Sentiment. Please remember document classification (a common form of sentiment analysis) does not count as structured prediction. However, fine-grained sentence-level or phrase-level annotation does count. E.g. the MPQA opinion corpus, or Congressional floor debate transcripts.
Discourse. The Congressional floor debates dataset (conversational information to predict). Discourse segmentation: lecture transcripts (above). Also, Penn Discourse Treebank.
In general, the LDC has many linguistically annotated datasets. CMU (through the LTI) has a subscription to obtain many of them.
Some more places to check: Noah's book, the Jurafsky&Martin textbook, or the Manning&Schuetze textbook, or the big lists on the ACL Wiki, for example the "English Corpora" listing.
Many, many more possibilities!

Please use the space below to post your ideas and/or find potential project partners.

Project Brainstorming for 10-710 in Fall 2011

Some possible datasets

Brainstorming ideas

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools