Project Abstract - Bo, Kevin, Rushin

From Cohen Courses
Jump to navigationJump to search

Social Media Analysis (xx-xxx) Project Proposal

Team Members

Bo Lin [bolin@cs.cmu.edu]

Kevin Dela Rosa [kdelaros@cs.cmu.edu]

Rushin Shah [rnshah@cs.cmu.edu]

Summary

We propose to implement attribute and relation extraction modules that take chains produced by a Within Document Coreference (WDC) system as input and output attributes for each chain and relationship labels for pairs of chains. We propose to use these modules to augment a Cross Document Coreference (CDC) system and a cross-document visualization tool that we have built as part of our prior research.

Background

As part of our research, we have built a cross-document co-reference (CDC) system that has the following high-level architecture:

• We first run a WDC system on a document and extract chains, with each chain corresponding to a real-world entity. Each chain may contain one or more of the following three types of mentions: Named (e.g. Barack Obama), Nominal (e.g. The President) or Pronominal (e.g. he). We also include in a chain all the sentences from which each of its mentions are derived.

• We then define a variety of features over pairs of such chains. These include all word TF-IDF similarity, proper noun TF-IDF similarity, proper noun Soft TF-IDF similarity, Soft TF-IDF similarity between the names (representative named mentions) of each chain, the semantic similarity between the descriptions (representative common nouns or noun phrases) of each chain, etc.

• Using these features, we train an SVM (libSVM) that classifies pairs of chains as being co-referent or not

• We take the outputs of this classifier and cluster all the chains that we have gathered from all the documents in the corpus.

• We store a persistent database of entities using this clustering, whereby each cluster represents a real-world entity. In other words, an entity is a list of chains in our database.

Motivation

• We wish to augment our CDC system to store more information for entities than just a list of chains. It would be helpful to retain a summary of useful attribute information for each entity, such as gender, nationality, occupation, email address, phone number, etc.

• We also believe that by extracting such attributes at the chain level and using them as additional features in our SVM, we may be able to improve the performance of our CDC system.

• Our current cross-document visualization tool is only capable of modeling relationships using co-occurrence statistics, and we wish to have a more descriptive way of representing relationships

• On a broader level, we wish to examine the upper limit of recall and precision associated with these problems, i.e. find answers to the questions:

o For how many entities does a given attribute exist in the data?

o For all such attributes, how accurately can we extract them?

Dataset

• To train and test our attribute and relation extraction modules, we plan to use one of the various ACE datasets (probably ACE 2004 or ACE 2005).

• For our CDC system, we are using the John Smith corpus, and WePS corpora, and a set of 400000 news articles over summer 2010, produced and labeled by a commercial organization.

Techniques

• For attribute extraction, we plan to implement standard algorithms that take seed examples of entities and attributes and learn extraction patterns, as introduced by Ravichandran and Hovy, 2002 “Learning surface text patterns for a question answering system”. [1]

• For relationship extraction, we plan to implement one of the papers referenced by Sunita Sarawagi in her survey on Information Extraction. [2]

• We may use different methods if we come across better ones while surveying related literature over the course of the semester

Superpowers

We have none. But in terms of our individual backgrounds, Bo and Rushin have been working with Bob Frederking and Anatole Gershman on entity extraction and co-reference resolution [3], and Kevin has been working on question answering and computer assisted language learning.