Difference between revisions of "Structured Prediction 10-710 in Fall 2011"

From Cohen Courses
Jump to navigationJump to search
(Created page with '== Instructor and Venue == * Instructors: [http://www.cs.cmu.edu/~wcohen William Cohen] and [http://www.cs.cmu.edu/~nasmith Noah Smith], Machine Learning Dept and LTI * Course s…')
 
 
(133 intermediate revisions by 33 users not shown)
Line 6: Line 6:
 
* Course Number: ML 10-710 and LTI 11-763
 
* Course Number: ML 10-710 and LTI 11-763
 
* Prerequisites: a machine learning course (e.g., 10-701 or 10-601) or consent of the instructor.
 
* Prerequisites: a machine learning course (e.g., 10-701 or 10-601) or consent of the instructor.
* TA: Brendan O'Connor
+
* TA: [http://brenocon.com/ Brendan O'Connor]
 
* Syllabus: [[Syllabus for Structured Prediction 10-710 in Fall 2011]]
 
* Syllabus: [[Syllabus for Structured Prediction 10-710 in Fall 2011]]
* Office hours: TBA
+
* Office hours:
 +
** Noah, GHC 5723, Thursdays 4:30-5:30 (starting 9/8)
 +
** Brendan, GHC 8005, Tuesdays 4:30-5:30
 +
** William, GHC 8217, Fridays 11:00-12:00 (starting 9/16)
  
 
== Description ==
 
== Description ==
  
to write
+
This course seeks to cover statistical modeling techniques for discrete, structured data such as text.  It brings together content previously covered in Language and Statistics 2 (11-762) and Information Extraction (10-707 and 11-748), and aims to define a canonical set of models and techniques applicable to problems in natural language processing, information extraction, and other application areas.  Upon completion, students will have a broad understanding of machine learning techniques for structured outputs, will be able to develop appropriate algorithms for use in new research, and will be able to critically read related literature.  The course is organized around methods, with example tasks introduced throughout.
 +
 
 +
The prerequisite is Machine Learning (10-601 or 10-701), or permission of the instructors.
  
 
== Syllabus ==
 
== Syllabus ==
Line 19: Line 24:
  
 
Older syllabi:
 
Older syllabi:
 
+
* [http://www.cs.cmu.edu/~nasmith/LS2/ Course page for Language and Stats 2], one of the "parent" courses of Structured Prediction:
* [[Syllabus for Structured Prediction 10-707 in Fall 2010]].
+
* Older syllabi for Information Extraction, another of the "parent" courses of Structured Prediction:
* [[Syllabus for Information Extraction 10-707 in Fall 2009]]
+
** [[Syllabus for Information Extraction 10-707 in Fall 2010|Fall 2010]], [[Syllabus for Information Extraction 10-707 in Fall 2009|Fall 2009]], and for historical interest, [http://wcohen.com/10-707/index-2007.html 10-707 Spring 2007], [http://wcohen.com/10-707/index-2004.html 10-707 Spring 2004].
* [http://wcohen.com/10-707/index-2007.html Syllabus for Information Extraction 10-707 in Spring 2007] - for historical interest.
 
* [http://wcohen.com/10-707/index-2004.html Syllabus for Information Extraction 10-707 in Spring 2004] - even more historical and less interesting.
 
  
 
== Readings ==
 
== Readings ==
Line 33: Line 36:
 
Grades are based on
 
Grades are based on
 
* The class project
 
* The class project
* The paper presentation
+
** Choose teams and a general project topic.  (This can change in the coming weeks/month.)  Create a team wiki page, add its members and the project topic.  Every team member then should link to it from their own user homepage.
* Contributions to the wiki
+
** Final reports should be in the [http://www.icml-2011.org/format.php ICML 2011 format].  Aim for 6-10 pages including citations.  Please be concise; we do not encourage you to write a report that is longer than necessary.
 +
* [[Wiki writeup assignments for 10-710 in Fall 2011|Wiki writeup assignments]]
 
* Class participation
 
* Class participation
 +
 +
== Attendees ==
 +
 +
People taking this class in Fall 2011 include:
 +
* [[User:Dmovshov|Dana Movshovitz-Attias]]
 +
* [[User:Ysim|Yanchuan Sim (yc)]]
 +
* [[User:Asaluja|Avneesh Saluja]]
 +
* [[User:Junyangn|Junyang Ng]]
 +
* [[User:Lingwang|Wang Ling]]
 +
* [[User:Mg1|Matt Gardner]]
 +
* [[User:Aanavas|Tony Navas]]
 +
* [[User:yunwang|Yun Wang (Maigo)]]
 +
* [[User:amr1|Andrew Rodriguez]]
 +
* [[User:cheuktol|Cheuk To Law (Kelvin)]]
 +
* [[User:manajs|Manaj Srivastava]]
 +
* [[User:Fkeith|Francis Keith]]
 +
* [[User:Dkulkarn|Dhananjay Kulkarni]]
 +
* [[User:Yww|William Yang Wang]]
 +
* [[User:Emayfiel|Elijah Mayfield]]
 +
* [[User:Taruns|Tarun Sharma]]
 +
* [[User:Mridulg|Mridul Gupta]]
 +
* [[User:Xiaoqiy|Xiaoqi Yin(Philip)]]
 +
* [[User:Daegunw|Daegun Won]]
 +
* [[User:Ruipedrocorreia|Rui Correia]]
 +
* [[User:Wpang|Wangshu Pang(Wash)]]
 +
* [[User:Howarth| Dan Howarth]]
 +
* [[User:Dwijaya| Derry Wijaya]]
 +
* [[User:Jmflanig| Jeff Flanigan]]
 +
* [[User:Tkumar| Tarun Kumar]]
 +
 +
* [[User:akgoyal| Anuj Goyal]]
 +
* [[User:Avinava.dubey| Avinava Dubey]]
 +
* [[User:Dyogatam| Dani Yogatama]]
 +
 +
Here are sample pages for [[User:Wcohen|William]], [[User:Nasmith|Noah]], and [[User:Brendan|Brendan]].
 +
 +
== Projects ==
 +
 +
== Final presentation dates ==
 +
 +
Tues 12/6
 +
* 3:05 Word Alignments using an HMM-based model - Wang Ling and Rui Correia
 +
* 3:17 Training SMT Systems with the Latent Structured SVM - Avneesh Saluja and Jeff Flanigan
 +
* 3:29 Semi-supervised Generation of Wikipedia Infoboxes - Wangshu Pang, Yun Wang and Matt Gardner
 +
* 3:41 Relevant Information Extraction from Court-room Hearings To Predict Judgement - Manaj Srivastava, Mridul Gupta
 +
* 3:53 Stylistic Structure Extraction from Early United States Slave-related Legal Opinions William Y. Wang and Elijah Mayfield
 +
* 4:05 Restaurant Recommendations Based On Review Content (updated!) - Junyang Ng, Yan Chuan Sim, Kelvin Law
 +
 +
Thurs 12/8
 +
* 3:05 Automated Template Extraction - Francis Keith, Andrew Rodriguez
 +
* 3:17 Learning Indian Classical Music Using Sequential Models - Dhananjay Kulkarni, Tarun Kumar
 +
* 3:29 Finding out who you are from where, when, what and with whom you tweet - Derry Wijaya, Tarun Sharma
 +
* 3:41 Wikipedia Infobox Generator Using Cross Lingual Unstructured Text - Daegun Won and Tony Navas
 +
* 3:53 Identifying Abbreviations in Biomedical Text - Dana Movshovitz-Attias
 +
 +
 +
== Project list ==
 +
 +
(should get comments from Brendan:)
 +
* [[Automated Template Extraction]] - [[User:Fkeith|Francis Keith]], [[User:amr1|Andrew Rodriguez]]
 +
* [[Project:Tweet | Finding out who you are from where, when, what and with whom you tweet]] - [[User:Dwijaya|Derry Wijaya]], [[User:taruns|Tarun Sharma]]
 +
* [[Information_Extraction_to_Predict_Judgement|Relevant Information Extraction from Court-room Hearings To Predict Judgement]] - [[User:manajs|Manaj Srivastava]], [[User:mridulg|Mridul Gupta]]
 +
 +
(should get comments from Noah:)
 +
* [[Stylistic Structure in Historic Legal Text|Stylistic Structure Extraction from Early United States Slave-related Legal Opinions]] [[User:Yww|William Y. Wang]] and [[User:Emayfiel|Elijah Mayfield]]
 +
* [[Word Alignments using an HMM-based model]] - [[User:Lingwang|Wang Ling]] and [[User:Ruipedrocorreia|Rui Correia]]
 +
* [[Training SMT Systems with the Latent Structured SVM]] - [[User:Asaluja|Avneesh Saluja]] and [[User:Jmflanig| Jeff Flanigan]]
 +
* [[Wikipedia Infobox Generator Using Cross Lingual Unstructured Text]] - [[User:Daegunw|Daegun Won]] and [[User:Aanavas|Tony Navas]]
 +
 +
(should get comments from William:)
 +
* [[Semi-supervised Generation of Wikipedia Infoboxes]] - [[User:wpang|Wangshu Pang]], [[User:Yunwang|Yun Wang]] and [[User:Mg1|Matt Gardner]]
 +
* [[Restaurant Recommendations Based On Review Content]]  (updated!) - [[User:Junyangn|Junyang Ng]], [[User:Ysim| Yan Chuan Sim]], [[User:Cheuktol|Kelvin Law]]
 +
* [[Project:Dmovshov_abbreviations | Identifying Abbreviations in Biomedical Text]] - [[User:Dmovshov|Dana Movshovitz-Attias]]
 +
* [[Project:Learning_Indian_Classical_Using_Sequential_Models| Learning Indian Classical Music Using Sequential Models]] - [[User:dkulkarn|Dhananjay Kulkarni]], [[User:tkumar|Tarun Kumar]]
 +
 +
(older ideas:)
 +
* [[Improving SMT word alignment with binary feedback]] - [[User:Asaluja|Avneesh Saluja]] and [[User:Jmflanig| Jeff Flanigan]]
 +
* [[Linearizing Dependency Trees]] - [[User:Jmflanig| Jeff Flanigan]]
 +
* [[Mapping entity names in a document to places on a map]].
 +
* Automatically generating headings for sections (group of contiguous paragraph) in unstructured text
 +
 +
In general, a nice way to find already-made datasets is to read papers in the literature and see what they use and reference.  A few data ideas: [[Project Brainstorming for 10-710 in Fall 2011/Some data ideas]]

Latest revision as of 14:11, 30 November 2011

Instructor and Venue

  • Instructors: William Cohen and Noah Smith, Machine Learning Dept and LTI
  • Course secretary: Sharon Cavlovich, sharonw+@cs.cmu.edu, 412-268-5196
  • When/where: Tues-Thursday 3:00-4:20 in Gates-Hillman 4211
  • Course Number: ML 10-710 and LTI 11-763
  • Prerequisites: a machine learning course (e.g., 10-701 or 10-601) or consent of the instructor.
  • TA: Brendan O'Connor
  • Syllabus: Syllabus for Structured Prediction 10-710 in Fall 2011
  • Office hours:
    • Noah, GHC 5723, Thursdays 4:30-5:30 (starting 9/8)
    • Brendan, GHC 8005, Tuesdays 4:30-5:30
    • William, GHC 8217, Fridays 11:00-12:00 (starting 9/16)

Description

This course seeks to cover statistical modeling techniques for discrete, structured data such as text. It brings together content previously covered in Language and Statistics 2 (11-762) and Information Extraction (10-707 and 11-748), and aims to define a canonical set of models and techniques applicable to problems in natural language processing, information extraction, and other application areas. Upon completion, students will have a broad understanding of machine learning techniques for structured outputs, will be able to develop appropriate algorithms for use in new research, and will be able to critically read related literature. The course is organized around methods, with example tasks introduced throughout.

The prerequisite is Machine Learning (10-601 or 10-701), or permission of the instructors.

Syllabus

Older syllabi:

Readings

Unless there's announcement to the contrary, required readings should be done before the class.

Grading

Grades are based on

  • The class project
    • Choose teams and a general project topic. (This can change in the coming weeks/month.) Create a team wiki page, add its members and the project topic. Every team member then should link to it from their own user homepage.
    • Final reports should be in the ICML 2011 format. Aim for 6-10 pages including citations. Please be concise; we do not encourage you to write a report that is longer than necessary.
  • Wiki writeup assignments
  • Class participation

Attendees

People taking this class in Fall 2011 include:

Here are sample pages for William, Noah, and Brendan.

Projects

Final presentation dates

Tues 12/6

  • 3:05 Word Alignments using an HMM-based model - Wang Ling and Rui Correia
  • 3:17 Training SMT Systems with the Latent Structured SVM - Avneesh Saluja and Jeff Flanigan
  • 3:29 Semi-supervised Generation of Wikipedia Infoboxes - Wangshu Pang, Yun Wang and Matt Gardner
  • 3:41 Relevant Information Extraction from Court-room Hearings To Predict Judgement - Manaj Srivastava, Mridul Gupta
  • 3:53 Stylistic Structure Extraction from Early United States Slave-related Legal Opinions William Y. Wang and Elijah Mayfield
  • 4:05 Restaurant Recommendations Based On Review Content (updated!) - Junyang Ng, Yan Chuan Sim, Kelvin Law

Thurs 12/8

  • 3:05 Automated Template Extraction - Francis Keith, Andrew Rodriguez
  • 3:17 Learning Indian Classical Music Using Sequential Models - Dhananjay Kulkarni, Tarun Kumar
  • 3:29 Finding out who you are from where, when, what and with whom you tweet - Derry Wijaya, Tarun Sharma
  • 3:41 Wikipedia Infobox Generator Using Cross Lingual Unstructured Text - Daegun Won and Tony Navas
  • 3:53 Identifying Abbreviations in Biomedical Text - Dana Movshovitz-Attias


Project list

(should get comments from Brendan:)

(should get comments from Noah:)

(should get comments from William:)

(older ideas:)

In general, a nice way to find already-made datasets is to read papers in the literature and see what they use and reference. A few data ideas: Project Brainstorming for 10-710 in Fall 2011/Some data ideas