Shuguang's project abstract
Information Extraction (10-707) Project Proposal
Contents
Team Member
Shuguang Wang [swang@cs.pitt.edu]
Problem
In this project, we will work on an open information extraction (IE) problem. In an open domain, we can not assume the relations are predetermined before query time. Several projects/systems (TextRunner, WOE, and NELL) have been proposed for open information extraction task. They used various ways to construct training data and train a model for each type of relations. In this project, we would look at the problem from a different perspective.
Noun phrases are usually seen as potential entities mentioned in the text, and the relations (if any) between them were represented in the context of them. Previous Open IE systems answer the question of if the context is a particular relation between two entities, and build a binary classifier for each type of relation. Instead of analyzing the different types of relations separately, we are trying to build a single classifier with some non-lexical features to identify some "interesting" relations from all candidates.
Plan
Current open IE approaches try to determine if a string of text is a certain type of relation between two entities. In this project we determine if a string of text from the context represents an interesting relation between two entities. A natural way to deal with it is to treat it as a classification problem.
There are a few important issues in this task:
- Training data
- We do not really have the "labels" of the data, but we have the ontology extracted by NELL. We would use the confident patterns as the surrogates of the interesting relations.
- Features
- We need non-lexical features that can be used across different types of relations. We would extract them from the relations.
- Relations for "types"
- Although relations tend to be connecting entities, it is always useful to see if we can identify relations that connecting types of entities. This requires some hierarchical information among NPs, and we may be able to learn it automatically.
Motivation
This task can be seen as a complement to other open Information Extraction projects such as KnowItAll and ReadTheWeb.
Dataset
We have the access to ReadTheWeb data for this task, and two type of data will be used:
- The ontology extracted by ReadTheWeb.
- Patterns with very high confidence give us the positive examples of "interesting" relations.
- The list of relations in the form of NP pairs and their context.
- We will use this data to extract the non-lexical features. Also, we can get negative examples for the training data here.
Techniques
Standard classification methods can be used in this project. As we would explore many possible features, we would use SVM and MaxEnt models.
Evaluation
There are at least of a couple of ways to evaluate this task. We would use F1 as the evaluation metric.
- Use the ontology of ReadTheWeb
- We first will select all confident patterns from the ontology.
- Then we split it into training(80%), development(10%), and test sets (10%).
- All these sets will be mixed with the negative examples extracted from the context data.
- The performance will be evaluated on the test set, i.e., to see if we can predict these pattern correctly.
- Another way of evaluating it by human (if time permits)
- I would look at a couple of hundred relations that was randomly selected.
- The relations that can be potentially used in many domain will be judged as interesting.
My Expertise
I do not really have much superpower on this task. But I am familiar with many machine learning frameworks, and have been working with text data for some time in NLP and IR tasks.