Shuguang's project abstract
Information Extraction (10-707) Project Proposal
Contents
Team Member
Shuguang Wang [swang@cs.pitt.edu]
Problem
In this project, we will work on an open information extraction (IE) problem. In an open domain, we can not assume the relations are predetermined before query time. Several projects/systems have been proposed for open information extraction task. They used various ways to construct training data and train a model for each type of relations. In this project, we would look at the problem from a different perspective.
Noun phrases are usually seen as potential entities mentioned in the text, and the relations (if any) between them were represented in the context of them. Given the input of list of noun phrase pairs and their context, can we identify some interesting relations by building a single classifier?
Plan
Current open IE approaches try to determine if a string of text is a certain type of relation between two entities. In this project we determine if a string of text from the context represents an interesting relation between two entities. A natural way to deal with it is to treat it as a classification problem.
There are at least a couple of issues that I am not very clear about this task yet. I would need to look at the data and see exactly what we can do about it. First, we need to explore different possible features. Second, we may need to generate training data for the classifier as we may not have the labels.
Motivation
This task can be seen as a complement to other open Information Extraction projects such as KnowItAll and ReadTheWeb.
Dataset
We should have the access to ReadTheWeb data for this task.
Techniques
Standard classification methods will be used.
Evaluation
The task will be evaluated by human (myself) on a set of randomly selected text. If time permits, I would also try to use the extracted relations in some IR tasks to see if the extracted relations are useful in practice.
My Expertise
I do not really have much superpower on this task. But I am familiar with many machine learning frameworks, and have been working with text data for some time in NLP and IR tasks.