Difference between revisions of "Project Abstract - Rushin, Kevin, Bo"

From Cohen Courses
Jump to navigationJump to search
(Created page with 'Information Extraction (11-748/10-707) Project Proposal Team Members: Bo Lin [bolin@cs.cmu.edu] Kevin Dela Rosa [kdelaros@cs.cmu.edu] Rushin Shah [rnshah@cs.cmu.edu] [['''Su…')
 
Line 1: Line 1:
 
Information Extraction (11-748/10-707) Project Proposal
 
Information Extraction (11-748/10-707) Project Proposal
  
Team Members:
+
== Team Members ==
  
 
Bo Lin [bolin@cs.cmu.edu]
 
Bo Lin [bolin@cs.cmu.edu]
Line 9: Line 9:
 
Rushin Shah [rnshah@cs.cmu.edu]
 
Rushin Shah [rnshah@cs.cmu.edu]
  
[['''Summary''']]
+
== Summary ==
  
 
We propose to implement attribute and relation extraction modules that take chains produced by a within-document co-reference (WDC) system as input and output attributes for each chain and relationship labels for pairs of chains. We propose to use these modules to augment a cross-document co-reference (CDC) system and a cross-document visualization tool that we have built as part of our prior research.
 
We propose to implement attribute and relation extraction modules that take chains produced by a within-document co-reference (WDC) system as input and output attributes for each chain and relationship labels for pairs of chains. We propose to use these modules to augment a cross-document co-reference (CDC) system and a cross-document visualization tool that we have built as part of our prior research.
  
[['''Background''']]
+
== Background ==
  
 
As part of our research, we have built a cross-document co-reference (CDC) system that has the following high-level architecture:
 
As part of our research, we have built a cross-document co-reference (CDC) system that has the following high-level architecture:
Line 27: Line 27:
 
• We store a persistent database of entities using this clustering, whereby each cluster represents a real-world entity. In other words, an entity is a list of chains in our database.
 
• We store a persistent database of entities using this clustering, whereby each cluster represents a real-world entity. In other words, an entity is a list of chains in our database.
  
[['''Task:''']]
+
== Task ==
  
 
1. For each co-reference chain, extract as many attributes as possible, from a predefined list of attributes (Gender, Nationality, Occupation, Phone, Email address)
 
1. For each co-reference chain, extract as many attributes as possible, from a predefined list of attributes (Gender, Nationality, Occupation, Phone, Email address)
Line 37: Line 37:
 
4. Use relationship labels to augment cross-document visualization tool
 
4. Use relationship labels to augment cross-document visualization tool
  
[['''Motivation''']]
+
== Motivation ===
  
 
• We wish to augment our CDC system to store more information for entities than just a list of chains. It would be helpful to retain a summary of useful attribute information for each entity, such as gender, nationality, occupation, email address, phone number, etc.  
 
• We wish to augment our CDC system to store more information for entities than just a list of chains. It would be helpful to retain a summary of useful attribute information for each entity, such as gender, nationality, occupation, email address, phone number, etc.  
Line 51: Line 51:
 
o For all such attributes, how accurately can we extract them?
 
o For all such attributes, how accurately can we extract them?
  
[['''Dataset:''']
+
== Dataset ==
 
 
 
• To train and test our attribute and relation extraction modules, we plan to use one of the various ACE datasets (probably ACE 2004 or ACE 2005).  
 
• To train and test our attribute and relation extraction modules, we plan to use one of the various ACE datasets (probably ACE 2004 or ACE 2005).  
Line 57: Line 57:
 
• For our CDC system, we are using the John Smith corpus, and WePS corpora, and a set of 400000 news articles over summer 2010, produced and labeled by a commercial organization.
 
• For our CDC system, we are using the John Smith corpus, and WePS corpora, and a set of 400000 news articles over summer 2010, produced and labeled by a commercial organization.
  
[['''Techniques:''']]
+
== Techniques ==
  
 
• For attribute extraction, we plan to implement standard algorithms that take seed examples of entities and attributes and learn extraction patterns, as introduced by Ravichandran and Hovy, 2002 “Learning surface text patterns for a question answering system”. [http://portal.acm.org/citation.cfm?id=1073083.1073092]
 
• For attribute extraction, we plan to implement standard algorithms that take seed examples of entities and attributes and learn extraction patterns, as introduced by Ravichandran and Hovy, 2002 “Learning surface text patterns for a question answering system”. [http://portal.acm.org/citation.cfm?id=1073083.1073092]
Line 65: Line 65:
 
• We may use different methods if we come across better ones while surveying related literature over the course of the semester
 
• We may use different methods if we come across better ones while surveying related literature over the course of the semester
  
[['''Superpowers''']]
+
== Superpowers ==
  
 
We have none. But in terms of our individual backgrounds, Bo and Rushin have been working with Bob Frederking and Anatole Gershman on entity extraction and co-reference resolution [http://www.cs.cmu.edu/~encore], and Kevin has been working on question answering and computer assisted language learning.
 
We have none. But in terms of our individual backgrounds, Bo and Rushin have been working with Bob Frederking and Anatole Gershman on entity extraction and co-reference resolution [http://www.cs.cmu.edu/~encore], and Kevin has been working on question answering and computer assisted language learning.

Revision as of 23:38, 28 September 2010

Information Extraction (11-748/10-707) Project Proposal

Team Members

Bo Lin [bolin@cs.cmu.edu]

Kevin Dela Rosa [kdelaros@cs.cmu.edu]

Rushin Shah [rnshah@cs.cmu.edu]

Summary

We propose to implement attribute and relation extraction modules that take chains produced by a within-document co-reference (WDC) system as input and output attributes for each chain and relationship labels for pairs of chains. We propose to use these modules to augment a cross-document co-reference (CDC) system and a cross-document visualization tool that we have built as part of our prior research.

Background

As part of our research, we have built a cross-document co-reference (CDC) system that has the following high-level architecture:

• We first run a WDC system on a document and extract chains, with each chain corresponding to a real-world entity. Each chain may contain one or more of the following three types of mentions: Named (e.g. Barack Obama), Nominal (e.g. The President) or Pronominal (e.g. he). We also include in a chain all the sentences from which each of its mentions are derived.

• We then define a variety of features over pairs of such chains. These include all word TF-IDF similarity, proper noun TF-IDF similarity, proper noun Soft TF-IDF similarity, Soft TF-IDF similarity between the names (representative named mentions) of each chain, the semantic similarity between the descriptions (representative common nouns or noun phrases) of each chain, etc.

• Using these features, we train an SVM that classifies pairs of chains as being co-referent or not

• We take the outputs of this classifier and cluster all the chains that we have gathered from all the documents in the corpus.

• We store a persistent database of entities using this clustering, whereby each cluster represents a real-world entity. In other words, an entity is a list of chains in our database.

Task

1. For each co-reference chain, extract as many attributes as possible, from a predefined list of attributes (Gender, Nationality, Occupation, Phone, Email address)

2. For each pair of co-reference chains, extract a relationship label wherever possible.

3. Use attributes of chains as features in the CDC system’s SVM classifier

4. Use relationship labels to augment cross-document visualization tool

Motivation =

• We wish to augment our CDC system to store more information for entities than just a list of chains. It would be helpful to retain a summary of useful attribute information for each entity, such as gender, nationality, occupation, email address, phone number, etc.

• We also believe that by extracting such attributes at the chain level and using them as additional features in our SVM, we may be able to improve the performance of our CDC system.

• Our current cross-document visualization tool is only capable of modeling relationships using co-occurrence statistics, and we wish to have a more descriptive way of representing relationships

• On a broader level, we wish to examine the upper limit of recall and precision associated with these problems, i.e. find answers to the questions:

o For how many entities does a given attribute exist in the data?

o For all such attributes, how accurately can we extract them?

Dataset

• To train and test our attribute and relation extraction modules, we plan to use one of the various ACE datasets (probably ACE 2004 or ACE 2005).

• For our CDC system, we are using the John Smith corpus, and WePS corpora, and a set of 400000 news articles over summer 2010, produced and labeled by a commercial organization.

Techniques

• For attribute extraction, we plan to implement standard algorithms that take seed examples of entities and attributes and learn extraction patterns, as introduced by Ravichandran and Hovy, 2002 “Learning surface text patterns for a question answering system”. [1]

• For relationship extraction, we plan to implement one of the papers referenced by Sunita Sarawagi in her survey on Information Extraction. [2]

• We may use different methods if we come across better ones while surveying related literature over the course of the semester

Superpowers

We have none. But in terms of our individual backgrounds, Bo and Rushin have been working with Bob Frederking and Anatole Gershman on entity extraction and co-reference resolution [3], and Kevin has been working on question answering and computer assisted language learning.