Difference between revisions of "Shuguang's project abstract"

From Cohen Courses
Jump to navigationJump to search
(Created page with 'Information Extraction (10-707) Project Proposal == Team Member == Shuguang Wang [swang@cs.pitt.edu] == Problem == In this project, we will work on an open inf…')
 
m
 
Line 7: Line 7:
 
== Problem ==
 
== Problem ==
  
In this project, we will work on an open information extraction (IE) problem. In an open domain, we can not assume the relations are predetermined before query time. Several projects/systems have been proposed for open information extraction task. They used various ways to construct training data and train a model for each type of relations. In this project, we would look at the problem from a different perspective.  
+
In this project, we will work on an open information extraction (IE) problem. In an open domain, we can not assume the relations are predetermined before query time. Several projects/systems (TextRunner, WOE, and NELL) have been proposed for open information extraction task. They used various ways to construct training data and train a model for each type of relations. In this project, we would look at the problem from a different perspective.  
  
Noun phrases are usually seen as potential entities mentioned in the text, and the relations (if any) between them were represented in the context of them. Given the input of list of noun phrase pairs and their context, can we identify some interesting relations by building a single classifier?
+
Noun phrases are usually seen as potential entities mentioned in the text, and the relations (if any) between them were represented in the context of them. Previous Open IE systems answer the question of if the context is a particular relation between two entities, and build a binary classifier for each type of relation. Instead of analyzing the different types of relations separately, we are trying to build a single classifier with some non-lexical features to identify some "interesting" relations from all candidates.
  
 
== Plan ==
 
== Plan ==
Line 15: Line 15:
 
Current open IE approaches try to determine if a string of text is a certain type of relation between two entities. In this project we determine if a string of text from the context represents an interesting relation between two entities. A natural way to deal with it is to treat it as a classification problem.  
 
Current open IE approaches try to determine if a string of text is a certain type of relation between two entities. In this project we determine if a string of text from the context represents an interesting relation between two entities. A natural way to deal with it is to treat it as a classification problem.  
  
There are at least a couple of issues that I am not very clear about this task yet. I would need to look at the data and see exactly what we can do about it. First, we need to explore different possible features. Second, we may need to generate training data for the classifier as we may not have the labels.
+
There are a few important issues in this task:
 +
# Training data
 +
#* We do not really have the "labels" of the data, but we have the ontology extracted by NELL. We would use the confident patterns as the surrogates of the interesting relations.
 +
# Features
 +
#* We need non-lexical features that can be used across different types of relations. We would extract them from the relations.
 +
# Relations for "types"
 +
#* Although relations tend to be connecting entities, it is always useful to see if we can identify relations that connecting types of entities. This requires some hierarchical information among NPs, and we may be able to learn it automatically.
  
 
== Motivation ==
 
== Motivation ==
Line 23: Line 29:
 
== Dataset ==
 
== Dataset ==
  
We should have the access to ReadTheWeb data for this task.
+
We have the access to ReadTheWeb data for this task, and two type of data will be used:
 +
* The ontology extracted by ReadTheWeb.
 +
** Patterns with very high confidence give us the positive examples of "interesting" relations.
 +
* The list of relations in the form of NP pairs and their context.
 +
** We will use this data to extract the non-lexical features. Also, we can get negative examples for the training data here.
  
 
== Techniques ==
 
== Techniques ==
  
Standard classification methods will be used.  
+
Standard classification methods can be used in this project. As we would explore many possible features, we would use SVM and MaxEnt models.
  
 
== Evaluation ==
 
== Evaluation ==
  
The task will be evaluated by human (myself) on a set of randomly selected text. If time permits, I would also try to use the extracted relations in some IR tasks to see if the extracted relations are useful in practice.
+
There are at least of a couple of ways to evaluate this task. We would use F1 as the evaluation metric.
 +
# Use the ontology of ReadTheWeb
 +
## We first will select all confident patterns from the ontology.
 +
## Then we split it into training(80%), development(10%), and test sets (10%).
 +
## All these sets will be mixed with the negative examples extracted from the context data.
 +
## The performance will be evaluated on the test set, i.e., to see if we can predict these pattern correctly.
 +
# Another way of evaluating it by human (if time permits)
 +
## I would look at a couple of hundred relations that was randomly selected.
 +
## The relations that can be potentially used in many domain will be judged as interesting.
  
 
== My Expertise ==
 
== My Expertise ==
  
 
I do not really have much superpower on this task. But I am familiar with many machine learning frameworks, and have been working with text data for some time in NLP and IR tasks.
 
I do not really have much superpower on this task. But I am familiar with many machine learning frameworks, and have been working with text data for some time in NLP and IR tasks.

Latest revision as of 15:07, 7 October 2010

Information Extraction (10-707) Project Proposal

Team Member

Shuguang Wang [swang@cs.pitt.edu]

Problem

In this project, we will work on an open information extraction (IE) problem. In an open domain, we can not assume the relations are predetermined before query time. Several projects/systems (TextRunner, WOE, and NELL) have been proposed for open information extraction task. They used various ways to construct training data and train a model for each type of relations. In this project, we would look at the problem from a different perspective.

Noun phrases are usually seen as potential entities mentioned in the text, and the relations (if any) between them were represented in the context of them. Previous Open IE systems answer the question of if the context is a particular relation between two entities, and build a binary classifier for each type of relation. Instead of analyzing the different types of relations separately, we are trying to build a single classifier with some non-lexical features to identify some "interesting" relations from all candidates.

Plan

Current open IE approaches try to determine if a string of text is a certain type of relation between two entities. In this project we determine if a string of text from the context represents an interesting relation between two entities. A natural way to deal with it is to treat it as a classification problem.

There are a few important issues in this task:

  1. Training data
    • We do not really have the "labels" of the data, but we have the ontology extracted by NELL. We would use the confident patterns as the surrogates of the interesting relations.
  2. Features
    • We need non-lexical features that can be used across different types of relations. We would extract them from the relations.
  3. Relations for "types"
    • Although relations tend to be connecting entities, it is always useful to see if we can identify relations that connecting types of entities. This requires some hierarchical information among NPs, and we may be able to learn it automatically.

Motivation

This task can be seen as a complement to other open Information Extraction projects such as KnowItAll and ReadTheWeb.

Dataset

We have the access to ReadTheWeb data for this task, and two type of data will be used:

  • The ontology extracted by ReadTheWeb.
    • Patterns with very high confidence give us the positive examples of "interesting" relations.
  • The list of relations in the form of NP pairs and their context.
    • We will use this data to extract the non-lexical features. Also, we can get negative examples for the training data here.

Techniques

Standard classification methods can be used in this project. As we would explore many possible features, we would use SVM and MaxEnt models.

Evaluation

There are at least of a couple of ways to evaluate this task. We would use F1 as the evaluation metric.

  1. Use the ontology of ReadTheWeb
    1. We first will select all confident patterns from the ontology.
    2. Then we split it into training(80%), development(10%), and test sets (10%).
    3. All these sets will be mixed with the negative examples extracted from the context data.
    4. The performance will be evaluated on the test set, i.e., to see if we can predict these pattern correctly.
  2. Another way of evaluating it by human (if time permits)
    1. I would look at a couple of hundred relations that was randomly selected.
    2. The relations that can be potentially used in many domain will be judged as interesting.

My Expertise

I do not really have much superpower on this task. But I am familiar with many machine learning frameworks, and have been working with text data for some time in NLP and IR tasks.