|
|
(12 intermediate revisions by 2 users not shown) |
Line 1: |
Line 1: |
− | = Some possible datasets =
| + | moved to bottom of homepage [[Structured_Prediction_10-710_in_Fall_2011]] |
− | | |
− | In general, a nice way to find already-made datasets is to read papers in the literature and see what they use and reference. A few ideas to get you started:
| |
− | | |
− | * The [http://ifarm.nl/signll/conll/ CoNLL Shared Tasks (scroll down the page)] have been running since 1999 and provide nice freely available annotated data. Phrase chunking, named entity tagging, dependency parsing, semantic role labeling and more.
| |
− | * Templated information extraction. [http://nlp.shef.ac.uk/dot.kom/resources.html CMU Seminar and Acquisitions datasets (Freitag)]. Also various MUC competition datasets (see e.g. old-school [http://www.isi.edu/~hobbs/fastus-schabes-jul95.pdf Hobbs 1995], or new-school [http://www-cs.stanford.edu/people/nc/pubs/acl2011-chambers-templates.pdf Chambers and Jurafsky 2011]).
| |
− | * Named entity recognition. e.g. [http://www.cnts.ua.ac.be/conll2002/ner/ CoNLL shared task 2002].
| |
− | * Segmentation. E.g. [http://nlp.stanford.edu/~grenager/data/unsupie.tgz Classified ad segmentation data] [http://nlp.stanford.edu/~grenager (Grenager)]. Or the CoNLL chunking tasks. Discourse-topic segmentation for [http://people.csail.mit.edu/jacobe/software.html lecture transcripts (Eisenstein)]. Others?
| |
− | * Part-of-speech tagging. E.g. [http://www.ark.cs.cmu.edu/TweetNLP/ CMU ARK Twitter annotated POS dataset], or [http://www.archive.org/details/BrownCorpus Brown corpus], any Penn Treebank or other treebank, many (or all?) of the CoNLL datasets, etc.
| |
− | * Semantic role labeling. CoNLL shared task in [http://www.lsi.upc.edu/~srlconll/ 2004, 2005].
| |
− | * Parsing.
| |
− | ** Dependency parsing: CoNLL shared tasks in [http://ilk.uvt.nl/conll/ 2006], and maybe 2007 too?
| |
− | ** Phrase structure parsing: an annotated dataset is called a "treebank". There are many treebanks, but the Penn Treebank is most famous. It is often used for part-of-speech tagging experiments too.
| |
− | ** Noah requests: do not do phrase structure parsing on the English Wall Street Journal Penn Treebank, it's been overdone. (Chinese or anything else is fine, or English dependency parsing is fine.)
| |
− | * Coreference. For within-document coreference, the MUC and ACE datasets are standard. For cross-document coreference, the "John Smith" dataset is [http://alias-i.com/lingpipe/demos/tutorial/cluster/read-me.html linked here].
| |
− | * BioNLP: The [http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/home/wiki.cgi?page=GENIA+corpus Genia corpus] has tons of syntactic and relational annotations, and an ontology too.
| |
− | * Logical-form semantic parsing: [http://www.cs.utexas.edu/users/ml/nldata/geoquery.html GeoQuery dataset (Mooney)] has been around for a while, and has had some interesting recent work. There's also one called the "jobs dataset" used in Liang 2011 and other papers.
| |
− | * Frame-semantic parsing: related to semantic role labeling, often related to FrameNet or VerbNet. There is a small amount FrameNet data (see [http://www.cs.cmu.edu/~nschneid/dscs.pdf Das et al 2010] for pointers)
| |
− | * Sentiment. Please remember document classification (a common form of sentiment analysis) does not count as structured prediction. However, fine-grained sentence-level or phrase-level annotation does count. E.g. the [http://www.cs.pitt.edu/mpqa/databaserelease/ MPQA opinion corpus], or [http://www.cs.cornell.edu/home/llee/data/convote.html Congressional floor debate transcripts] (which also has conversational information to predict).
| |
− | * In general, the [http://www.ldc.upenn.edu/ LDC] has many linguistically annotated datasets. CMU (through the LTI) has a subscription to obtain many of them.
| |
− | * Some more places to check: Noah's book, the Jurafsky&Martin textbook, or the Manning&Schuetze textbook, or the [http://aclweb.org/aclwiki/index.php?title=List_of_resources_by_language big lists on the ACL Wiki], for example the "English Corpora" listing.
| |
− | * Many, many more possibilities!
| |
− | | |
− | | |
− | Please use the space below to post your ideas and/or find potential project partners.
| |
− | | |
− | = Brainstorming ideas =
| |