Information Extraction 10-707 in Fall 2010
Instructor and Venue
- Instructor: William Cohen, Machine Learning Dept and LTI
- Course secretary: Sharon Cavlovich, sharonw+@cs.cmu.edu, 412-268-5196
- When/where: Mon/Wed 1:30-2:50, Gates 4101.
- Course Number: MLD 10-707, cross-listed in LTI as 11-748
- Prerequisites: a machine learning course (e.g., 10-701 or 10-601) or consent of the instructor.
- TA: there is no TA for this course
- Syllabus: Syllabus for Information_Extraction 10-707 in Fall 2010
- Office hours: TBA
Description
Information extraction is finding names of entities in unstructured or partially structured text, and determining the relationships that hold between these entities. More succinctly, information extraction is the problem of deriving structured factual information from text.
This course considers the problem of information extraction from a machine-learning prospective. We will survey a variety of learning methods that have been used for information extraction, including rule-learning, boosting, and sequential classification methods such as hidden Markov models, conditional random fields, and structured support vector machines. We will also look at experimental results from a number of specific information extraction domains, such as biomedical text, and discuss semi-supervised "bootstrapping" learning methods for information extraction.
Readings will be based on research papers. Grades will be based on class participation, paper presentations, and a project. More specifically, students will be expected to:
- Prepare summaries of the papers discussed in class. Summaries will be posted on this wiki.
- Present and summarize one or more "optional" papers from the syllabus (or some other mutually agreeable paper) to the class.
- Do a course project in a group of 2-3 people. Typical course projects might be: systematically comparing two or more existing extraction of integration methods on an existing dataset; exploring a new extraction or integration application, by collecting a dataset and evaluating an existing method; or rigorous formal analysis of a course-related topic. The end result of the project will be a written report, with format and length appropriate for a conference publication. Here are some sample projects from spring 2007:
- Combining n-gram based statistics with traditional methods for named entity recognition (Fette)
- Discriminative Online Algorithms for Sequence Labeling - A Comparative Study (Cohen, Gimpel)
- Using Information Extraction in Adaptive Filtering Relevance Feedback (Elsas, Lad and Rao)
- Tree Conditional Random Fields for Japanese Semantic Role Labeling (Shilpa, Lin, Hideki, Mengqiu)
Syllabus
I plan that the Fall 2010 course will spend about a third of the time covering various techniques for structured learning, a third of the time covering semi-supervised/bootstrapping methods, and the remainder on a wider variety of machine-learning methods that have been applied to information extraction.
Older syllabi:
- Syllabus for Information Extraction 10-707 in Fall 2009. The 2010 course will parallel this, with some updates.
- Syllabus for Information Extraction 10-707 in Spring 2007 - for historical interest.
- Syllabus for Information Extraction 10-707 in Spring 2004 - even more historical and less interesting.
Bibliography
Grading
Grades are based on
- The class project (50% - including the presentation and the writeup).
- The paper presentation (20%).
- The paper summaries submitted throughout the course (20%).
- Class participation (10%).