Difference between revisions of "Project Brainstorming for 10-710 in Fall 2011"

From Cohen Courses
Jump to navigationJump to search
 
(4 intermediate revisions by 2 users not shown)
Line 1: Line 1:
= Some possible datasets =
+
moved to bottom of homepage [[Structured_Prediction_10-710_in_Fall_2011]]
 
 
In general, a nice way to find already-made datasets is to read papers in the literature and see what they use and reference.  A few ideas to get you started:
 
 
 
* The [http://ifarm.nl/signll/conll/ CoNLL Shared Tasks (scroll down the page)] have been running since 1999 and provide nice freely available annotated data.  Phrase chunking, named entity tagging, dependency parsing, semantic role labeling and more.
 
* Templated information extraction.  [http://nlp.shef.ac.uk/dot.kom/resources.html CMU Seminar and Acquisitions datasets (Freitag)].  Also various MUC competition datasets (see e.g. old-school [http://www.isi.edu/~hobbs/fastus-schabes-jul95.pdf Hobbs 1995], or new-school [http://www-cs.stanford.edu/people/nc/pubs/acl2011-chambers-templates.pdf Chambers and Jurafsky 2011]).
 
* Named entity recognition.  e.g. [http://www.cnts.ua.ac.be/conll2002/ner/ CoNLL shared task 2002].
 
* Segmentation. E.g. [http://nlp.stanford.edu/~grenager/data/unsupie.tgz Classified ad segmentation data] [http://nlp.stanford.edu/~grenager (Grenager)].  Or the CoNLL chunking tasks.  Discourse-topic segmentation for [http://people.csail.mit.edu/jacobe/software.html lecture transcripts (Eisenstein)].  Others?
 
* Part-of-speech tagging.  E.g. [http://www.ark.cs.cmu.edu/TweetNLP/ CMU ARK Twitter annotated POS dataset], or [http://www.archive.org/details/BrownCorpus Brown corpus], any Penn Treebank or other treebank, many (or all?) of the CoNLL datasets, etc.
 
* Semantic role labeling.  CoNLL shared task in [http://www.lsi.upc.edu/~srlconll/ 2004, 2005].
 
* Parsing. 
 
** Dependency parsing: CoNLL shared tasks in [http://ilk.uvt.nl/conll/ 2006], and maybe 2007 too?
 
** Phrase structure parsing: an annotated dataset is called a "treebank".  There are many treebanks, but the Penn Treebank is most famous.  It is often used for part-of-speech tagging experiments too.
 
** Noah requests: do not do phrase structure parsing on the English Wall Street Journal Penn Treebank, it's been overdone.  (Chinese or anything else is fine, or English dependency parsing is fine.)
 
* Coreference.  For within-document coreference, the MUC and ACE datasets are standard.  For cross-document coreference, the "John Smith" dataset is [http://alias-i.com/lingpipe/demos/tutorial/cluster/read-me.html linked here].
 
* BioNLP: The [http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/home/wiki.cgi?page=GENIA+corpus Genia corpus] has tons of syntactic and relational annotations, and an ontology too.
 
* Logical-form semantic parsing: [http://www.cs.utexas.edu/users/ml/nldata/geoquery.html GeoQuery dataset (Mooney)] has been around for a while, and has had some interesting recent work.  There's also one called the "jobs dataset" used in Liang 2011 and other papers.
 
* Frame-semantic parsing: related to semantic role labeling, often related to FrameNet or VerbNet.  There is a small amount FrameNet data (see [http://www.cs.cmu.edu/~nschneid/dscs.pdf Das et al 2010] for pointers)
 
* Sentiment.  Please remember document classification (a common form of sentiment analysis) does not count as structured prediction.  However, fine-grained sentence-level or phrase-level annotation does count.  E.g. the [http://www.cs.pitt.edu/mpqa/databaserelease/ MPQA opinion corpus], or [http://www.cs.cornell.edu/home/llee/data/convote.html Congressional floor debate transcripts].
 
* Discourse.  The Congressional floor debates dataset (conversational information to predict).  Discourse segmentation: lecture transcripts (above).  Also, [http://www.seas.upenn.edu/~pdtb/ Penn Discourse Treebank].
 
* In general, the [http://www.ldc.upenn.edu/ LDC] has many linguistically annotated datasets.  CMU (through the LTI) has a subscription to obtain many of them.
 
* Some more places to check:
 
** On this wiki, besides the syllabus, also the [[Paper]] and [[Dataset]] pages.
 
** Noah's book, or the Jurafsky&Martin textbook, or even the old Manning&Schuetze textbook.  Also there are dataset [http://aclweb.org/aclwiki/index.php?title=List_of_resources_by_language lists on the ACL Wiki], for example the "English Corpora" listing (though not all of these are datasets appropriate for course projects, of course.)
 
* Many, many more possibilities!
 
 
 
 
 
Please use the space below to post your ideas and/or find potential project partners.
 
 
 
= Brainstorming ideas =
 
* Extracting Facts from Wikipedia Text to generate a Fact relation hierarchy - [[User:Akoul|Anirudh Koul]]
 

Latest revision as of 14:59, 8 September 2011

moved to bottom of homepage Structured_Prediction_10-710_in_Fall_2011