Shuguang's Project Report
Information Extraction (10-707) Project Report
Contents
Team Member
Shuguang Wang [swang@cs.pitt.edu]
Problem
In this project, we will work on an open information extraction (IE) problem. In an open domain, we can not assume the relations are predetermined before query time. Several projects/systems (TextRunner, WOE, and NELL) have been proposed for open information extraction task. They used various ways to construct training data and train a model for each type of relations. In this project, we would look at the problem from a different perspective.
Noun phrases are usually seen as potential entities mentioned in the text, and the relations (if any) between them were represented in the context of them. Previous Open IE systems answer the question of if the context is a particular relation between two entities, and build a binary classifier for each type of relation. Instead of analyzing the different types of relations separately, we are trying to build a single classifier with some non-lexical features to identify some "interesting" relations from all candidates.
Current open IE approaches try to determine if a string of text is a certain type of relation between two entities. In this project we determine if a string of text from the context represents an interesting relation between two entities. A natural way to deal with it is to treat it as a classification problem.
There are a few important issues in this task:
- Training data
- We do not really have the "labels" of the data, but we have the ontology extracted by NELL. We would use the confident patterns as the surrogates of the interesting relations.
- Features
- We need non-lexical features that can be used across different types of relations. We would extract them from the relations.
- Relations for "types"
- Although relations tend to be connecting entities, it is always useful to see if we can identify relations that connecting types of entities. This requires some hierarchical information among NPs, and we may be able to learn it automatically.
Raw Data from NELL and ClueWeb
We have the access to NELL(ReadTheWeb) ontology and context for all NPs extracted from ClueWeb. We analyze the both data and understand it.
- The NELL ontology
- We only use the relations in this ontology, i.e., (arg1, relation, arg2).
- An example of relation in this ontology is (pirates, teamplayssport, baseball).
- Each instance of relation contains some extra information:
- Literal strings of entities: pirates --> "Pirates", "pirates" and baseball --> "Baseball", "baseball", "BASEBALL".
- Confidence scores of relations: 0 ~ 1 (probability like numbers but not really probability)
- List of relevant context for this relation: "arg1 played better arg2", "arg2 operations for arg1", ...
- We only use the relations in this ontology, i.e., (arg1, relation, arg2).
- The ClueWeb data
- The list of context for all NP pairs.
- An example of NP pairs: "Pirates || baseball".
- The list of context for this pair: "arg2 are the Pittsburgh arg1", "arg1 succeed and arg2", "arg2 players than arg1", ...
- The list of context for all NP pairs.
Statistics of NELL Ontology
- 10,392 instances of 98 types of relations extracted
- 8,987 unique entities in the extracted data
- ~ 7.7 relevant context per instance of relation
Statistics of ClueWeb Context Data
- Over 125 million NP pairs.
- Over 5 million unique NPs.
- ~ 14.1 context for each NP pair.
Constructing Training Data
Given the raw data from NELL and ClueWeb, we would like to construct the training data. This task can be seen as a complement to other open Information Extraction projects such as KnowItAll and ReadTheWeb.
Learning
Standard classification methods can be used in this project. As we would explore many possible features, we would use SVM and MaxEnt models.
Preliminary Results
There are at least of a couple of ways to evaluate this task. We would use F1 as the evaluation metric.
- Use the ontology of ReadTheWeb
- We first will select all confident patterns from the ontology.
- Then we split it into training(80%), development(10%), and test sets (10%).
- All these sets will be mixed with the negative examples extracted from the context data.
- The performance will be evaluated on the test set, i.e., to see if we can predict these pattern correctly.
- Another way of evaluating it by human (if time permits)
- I would look at a couple of hundred relations that was randomly selected.
- The relations that can be potentially used in many domain will be judged as interesting.