Shuguang's project abstract

From Cohen Courses
Revision as of 23:02, 29 September 2010 by PastStudents (talk | contribs) (Created page with 'Information Extraction (10-707) Project Proposal == Team Member == Shuguang Wang [swang@cs.pitt.edu] == Problem == In this project, we will work on an open inf…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Information Extraction (10-707) Project Proposal

Team Member

Shuguang Wang [swang@cs.pitt.edu]

Problem

In this project, we will work on an open information extraction (IE) problem. In an open domain, we can not assume the relations are predetermined before query time. Several projects/systems have been proposed for open information extraction task. They used various ways to construct training data and train a model for each type of relations. In this project, we would look at the problem from a different perspective.

Noun phrases are usually seen as potential entities mentioned in the text, and the relations (if any) between them were represented in the context of them. Given the input of list of noun phrase pairs and their context, can we identify some interesting relations by building a single classifier?

Plan

Current open IE approaches try to determine if a string of text is a certain type of relation between two entities. In this project we determine if a string of text from the context represents an interesting relation between two entities. A natural way to deal with it is to treat it as a classification problem.

There are at least a couple of issues that I am not very clear about this task yet. I would need to look at the data and see exactly what we can do about it. First, we need to explore different possible features. Second, we may need to generate training data for the classifier as we may not have the labels.

Motivation

This task can be seen as a complement to other open Information Extraction projects such as KnowItAll and ReadTheWeb.

Dataset

We should have the access to ReadTheWeb data for this task.

Techniques

Standard classification methods will be used.

Evaluation

The task will be evaluated by human (myself) on a set of randomly selected text. If time permits, I would also try to use the extracted relations in some IR tasks to see if the extracted relations are useful in practice.

My Expertise

I do not really have much superpower on this task. But I am familiar with many machine learning frameworks, and have been working with text data for some time in NLP and IR tasks.