Shuguang's Project Report

From Cohen Courses
Jump to navigationJump to search

Information Extraction (10-707) Project Report

Team Member

Shuguang Wang [swang@cs.pitt.edu]

Problem

In this project, we work on an open information extraction (IE) problem. In an open domain, we can not assume the relations are predetermined before query time. Several projects/systems (TextRunner, WOE, and NELL) have been proposed for open information extraction task. They used various ways to construct training data and train a model for each type of relations. In this project, we look at the problem from a different perspective.

Noun phrases are usually seen as potential entities mentioned in the text, and the relations (if any) between them were represented in the context of them. Previous Open IE systems answer the question of if the context is a particular relation between two entities, and build a binary classifier for each type of relation. Instead of analyzing the different types of relations separately, we are trying to build a single classifier with some non-lexical features to identify some "interesting" relations from all context candidates.

Current open IE approaches try to determine if a string of text is a certain type of relation between two entities. In this project we determine if a string of text from the context represents an interesting relation between two entities. A natural way to deal with it is to treat it as a classification problem.

There are a few important issues in this task:

  1. Training data
    • We do not really have the "labels" of the data, but we have the ontology extracted by NELL. The first thing for us is to construct training data from the raw data: NELL ontology and ClueWeb.
  2. Features
    • We need non-lexical features that can be used across different types of relations. We would extract them from the relations.
  3. Relations for "types"
    • Although relations tend to be connecting entities, it is always useful to see if we can identify relations that connecting types of entities. This requires some hierarchical information among NPs, and we may be able to learn it automatically.

So far, we have finished constructing the training data and running a simple bag of word baseline.

Raw Data from NELL and ClueWeb

We have the access to NELL(ReadTheWeb) ontology and context for all NPs extracted from ClueWeb. We analyze and understand both data by now.

  • The NELL ontology
    • We only use the relations in this ontology, i.e., (arg1, relation, arg2).
      • An example of relation in this ontology is (pirates, teamplayssport, baseball).
      • Each instance of relation contains some extra information:
        • Literal strings of entities: pirates --> "Pirates", "pirates", and baseball --> "Baseball", "baseball", "BASEBALL".
        • Confidence scores of relations: 0 ~ 1 (probability like numbers but not really probability)
        • List of relevant context for this relation: "arg1 played better arg2", "arg2 operations for arg1", ...
  • The ClueWeb data
    • The list of context for all NP pairs.
      • An example of NP pairs: "Pirates || baseball".
      • The list of context for this pair: "arg2 are the Pittsburgh arg1", "arg1 succeed and arg2", "arg2 players than arg1", ...

Statistics of NELL Ontology

  • 10,392 instances of 98 types of relations extracted
  • 8,987 unique entities in the extracted data
  • ~ 7.7 relevant context per instance of relation

Statistics of ClueWeb Context Data

  • Over 125 million NP pairs.
  • Over 5 million unique NPs.
  • ~ 14.1 context for each NP pair.

Constructing Training Data

Given the raw data from NELL and ClueWeb, we would like to construct the training data in two steps.

  1. Match the NP pairs from ClueWeb to entities of relations in NELL ontology.
    • If the pair of NPs match exactly the literal strings of entities, we know there is a relation between this NP pair.
  2. Identify relevant context from the context list from ClueWeb.
    • In the list of the context of the given matched NP pair,
      • context is relevant if it matches one of the relevant context in the NELL ontology
      • context is irrelevant otherwise

Noisy Nature of the Ontology

Use this procedure, we can find over 12,000 NP pairs that match the relations in the NELL ontology. In these set, we have about 9.4 relevant context and about 66.2 irrelevant context per NP pair in average. However, they are not always good training data. Here is the histogram of the confidence scores of matched relations. 25p

Most of the relations have confidence scores higher than 90%, however some relations could still introduce noise even if they have over 90% confidence. Here is an example: NP pair, (college || education), has over 7000 irrelevant context. One of the irrelevant context, "arg1 are keen to offer arg2", does not seem to be irrelevant. The reason why it is not seen as relevant to the relation is because the ontology have relative low coverage for all relevant context for this relation even though the confidence of this relation is 93%.

Therefore, we use a very high value threshold to construct the training data. The context is relevant or used as a positive sample if the corresponding relation in the ontology has the confidence score greater than 99%, and the context is irrelevant or used as a negative sample if the corresponding relation in the ontology has the confidence score great than 99.9999%. We set even higher threshold for negative sample to avoid introducing too many false negative data.

After we set the thresholds, we get over 120k positive samples and over 112k negative samples. We use the 90%/10% random splits as the training and test set.

Learning

The problem is treated as a classification task and we use the standard classifiers, SVM and Logistic Regression. The tool we use is the LibLinear (Fan et al., JMLR 2008) which contains various SVM and Logistic Regression implementations.

Preliminary Results

By now, we managed to run a simple baseline system and get some preliminary results.

Baseline

We use SVM and Logistic Regression as the classifiers and words in context as the features (bag of words model). The parameters of the classifier were optimized using 5-fold cross validation on the training data. The evaluation metrics are accuracy, precision, recall, and F1. The table below summarizes the results.

Evaluation Results on Test Set
Logistic Regression SVM
Accuracy 0.848 0.850
Precision 0.826 0.831
Recall 0.895 0.893
F1 0.859 0.861

Initial Analysis

The results are surprisingly good considering we only use simple words as features. Assuming we do everything correctly so far, these results mean we have a good chance to predict if the context is relevant to a relation.

Below is the histogram of feature weights from L1 regularized Logistic Regression. From the plot, we can see there are some terms are important to the classification. Here is the list to top 10 terms with the highest absolute weights.

  1. Ferrari
  2. Dentistry
  3. talented
  4. Estate
  5. Pierce
  6. leader
  7. travelers
  8. Companies
  9. plane
  10. Hotels

L1lr weights hist.jpg

Next Steps

There are a few things we will do next.

  • Need to verify and analyze the results.
    • Need to manually select a couple of hundred context and annotate them. Try the classifier on this new data to see if we get consistent results.
  • If the results are consistent,
    • we would do some feature engineering, especially the non-lexical ones such as POS, counts, length, and etc.
    • we would try to see if it is possible to find some context that are common across a class of entities instead of specific ones.
    • we may also revisit the raw data again to see if there is some good data from the less confident relations.