Talukdar and Pereira ACL 2010

From Cohen Courses
Revision as of 13:45, 30 November 2010 by PastStudents (talk | contribs) (Created page with '== Citation == Partha Pratim Talukdar and Fernando Pereira. 2010. Experiments in graph-based semi-supervised learning methods for class-instance acquisition. In Proceedings of t…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Citation

Partha Pratim Talukdar and Fernando Pereira. 2010. Experiments in graph-based semi-supervised learning methods for class-instance acquisition. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL '10). Association for Computational Linguistics, Morristown, NJ, USA, 1473-1481.

Online version

ACL Anthology

Summary

This paper paper conducted an empirical comparison of three graph based semi-supervised learning methods for the Class-Instance Acquisition task.

Motivation

Traditional NER have focused on a small number of classes such as person and location. These classes are too broad to be useful for applications like word sense disambiguation and textual inference in practice. We have limited training data for supervised learning methods for the fine-grained classification. Therefore seed-based information extraction systems have been developed to extract new instances of a class from unstructured text using a few seed instances of that class.

Methods Compared

The general idea of graph based semi-supervised learning method works as follows: Given a connectivity graph which contains both labeled and unlabeled data, the labels of the labeled data are propagated to the unlabeled data through the graph with some constrains.

LP-ZGL (Zhu et al., ICML 2003) is the first graph based semi-supervised learning method. It propagates the labels of training data by ensuring the smoothness of the label assignment and preserving the labels of the training data. The smoothness (manifold assumption) of the label assignment implies the two highly connected nodes in the graph should have same or similar labels.

Adsorption (Baluja et al., WWW 2008) uses an iterative method.

    • Schema of infobox for a class was defined by first grouping articles with the same infobox template names and then selecting the most common attributes (used in >15% articles) from them.
    • Training data was generated by selecting (using heuristics) a unique sentence in the documents that contain attributes as the positive sample. The rest of the sentences in the documents are used as negative samples.
  • Document & Sentence Classification
    • A candidate document is identified using a heuristic approach: 1) to find list pages that match infobox class keywords, 2) and then classify the articles from the list pages based on their category tags.
    • A candidate sentence is identified using a classifier MaxEnt with bagging bagging with features: words and their POS tags.
  • Attribute Extraction
    • Negative training examples are ignored if sentences were classified as an candidate sentence in the previous step.
    • Attribute values are identified using CRF, one classifier for each attribute.

Link Generation was done also rather heuristically. The evaluation was done on Wikipedia 2007.02.06 data.

Related papers

This prototype was later used in a more general task of open domain information extraction task in Wu_and_Weld_ACL_2010.