Difference between revisions of "Talukdar and Pereira ACL 2010"

From Cohen Courses
Jump to navigationJump to search
m (Created page with '== Citation == Partha Pratim Talukdar and Fernando Pereira. 2010. Experiments in graph-based semi-supervised learning methods for class-instance acquisition. In Proceedings of t…')
 
 
(3 intermediate revisions by one other user not shown)
Line 8: Line 8:
 
== Summary ==
 
== Summary ==
  
This paper [[Category::paper]] conducted an empirical comparison of three graph based semi-supervised learning methods for the [[AddressesProblem::Class-Instance Acquisition]] task.
+
This [[Category::paper]] conducted an empirical comparison of three graph based semi-supervised learning (SSL) methods for the [[AddressesProblem::Class-Instance Acquisition]] task.
  
 
=== Motivation ===
 
=== Motivation ===
Traditional NER have focused on a small number of classes such as person and location. These classes are too broad to be useful for applications like word sense disambiguation and textual inference in practice. We have limited training data for supervised learning methods for the fine-grained classification. Therefore seed-based information extraction systems have been developed to extract new instances of a class from unstructured text using a few seed instances of that class.  
+
Traditional NER have focused on a small number of classes such as person and location. These classes are too broad to be useful for applications like word sense disambiguation and textual inference in practice. There are limited training data for supervised learning methods for the fine-grained classification. Therefore seed-based information extraction systems have been developed to extract new instances of a class from unstructured text using a few seed instances of that class.
 +
 
 +
Graph based semi-supervised methods have been successfully applied in this class instance acquisition task but there is no comparison done to demonstrate which methods are better.
  
 
=== Methods Compared ===
 
=== Methods Compared ===
The general idea of graph based semi-supervised learning method works as follows: Given a connectivity graph which contains both labeled and unlabeled data, the labels of the labeled data are propagated to the unlabeled data through the graph with some constrains.
+
The general idea of graph based semi-supervised learning method works as follows: Given a connectivity graph which contains both labeled (seeds) and unlabeled data, the labels of the labeled data are propagated to the unlabeled data through the graph with some constrains.
 +
 
 +
LP-ZGL (Zhu et al., ICML 2003) is the first graph based semi-supervised learning method. It propagates the labels of training data by ensuring the smoothness of the label assignment and preserving the labels of the training data. The smoothness (manifold assumption) of the label assignment implies that two highly connected nodes in the graph should have same or similar labels.
  
LP-ZGL (Zhu et al., ICML 2003) is the first graph based semi-supervised learning method. It propagates the labels of training data by ensuring the smoothness of the label assignment and preserving the labels of the training data. The smoothness (manifold assumption) of the label assignment implies the two highly connected nodes in the graph should have same or similar labels.
+
Adsorption (Baluja et al., WWW 2008) uses an iterative method. Its main idea is to limit the information that passes through each node and it relaxes the constrain of preserving the labels of training data.
  
Adsorption (Baluja et al., WWW 2008) uses an iterative method.
+
MAD (Talukdar and Crammer, ECML 2009) re-expressed the Adsorption method as a optimization problem. The labels of training data can also be changed.
** Schema of infobox for a class was defined by first grouping articles with the same infobox template names and then selecting the most common attributes (used in >15% articles) from them.
 
** Training data was generated by selecting (using heuristics) a unique sentence in the documents that contain attributes as the positive sample. The rest of the sentences in the documents are used as negative samples.
 
  
* Document & Sentence Classification
+
=== Evaluation ===
** A candidate document is identified using a heuristic approach: 1) to find list pages that match infobox class keywords, 2) and then classify the articles from the list pages based on their category tags.
+
The comparison was done on subsets of [[UsesDataset::Freebase|Freebase]] in 18 domains, TextRunner (Banko et al., IJCAI 2007) results. The evaluation metric used is the Mean Reciprocal Rank.
** A candidate sentence is identified using a classifier [[UsesMethod::Maximum Entropy model|MaxEnt]] with bagging [[UsesMethod::bagging|bagging]] with features: words and their POS tags.  
 
  
* Attribute Extraction
+
==== Results ====
** Negative training examples are ignored if sentences were classified as an candidate sentence in the previous step.
+
MAD performed significantly better than the other two on Freebase datasets using classes from (Pantel et al., EMNLP 2009) and WordNet as the gold standards. Three methods performs comparably on TextRunner data using the classes from WordNet. From the evaluation, MAD performs the best when the average degree of nodes are high.  
** Attribute values are identified using [[UsesMethod::Conditional Random fields|CRF]], one classifier for each attribute.
 
  
Link Generation was done also rather heuristically. The evaluation was done on [[UsesDataset::Wikipedia|Wikipedia 2007.02.06 data]].
+
The authors also investigated the impact of semantic constrains (additional edges between instances and attributes) to these graph based learning methods. These constrains are from YAGO KB (Suchanek et al, WWW 2008). The evaluation shows it is very beneficial to use these semantic constrains for these graph based semi-supervised learning methods.
  
 
== Related papers ==
 
== Related papers ==
This prototype was later used in a more general task of open domain information extraction task in [[Wu_and_Weld_ACL_2010]].
+
Three graph based SSL methods can be found in Zhu et al., ICML 2003, Baluja et al., WWW 2008, and Talukdar and Crammer, ECML 2009.

Latest revision as of 15:05, 7 December 2010

Citation

Partha Pratim Talukdar and Fernando Pereira. 2010. Experiments in graph-based semi-supervised learning methods for class-instance acquisition. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL '10). Association for Computational Linguistics, Morristown, NJ, USA, 1473-1481.

Online version

ACL Anthology

Summary

This paper conducted an empirical comparison of three graph based semi-supervised learning (SSL) methods for the Class-Instance Acquisition task.

Motivation

Traditional NER have focused on a small number of classes such as person and location. These classes are too broad to be useful for applications like word sense disambiguation and textual inference in practice. There are limited training data for supervised learning methods for the fine-grained classification. Therefore seed-based information extraction systems have been developed to extract new instances of a class from unstructured text using a few seed instances of that class.

Graph based semi-supervised methods have been successfully applied in this class instance acquisition task but there is no comparison done to demonstrate which methods are better.

Methods Compared

The general idea of graph based semi-supervised learning method works as follows: Given a connectivity graph which contains both labeled (seeds) and unlabeled data, the labels of the labeled data are propagated to the unlabeled data through the graph with some constrains.

LP-ZGL (Zhu et al., ICML 2003) is the first graph based semi-supervised learning method. It propagates the labels of training data by ensuring the smoothness of the label assignment and preserving the labels of the training data. The smoothness (manifold assumption) of the label assignment implies that two highly connected nodes in the graph should have same or similar labels.

Adsorption (Baluja et al., WWW 2008) uses an iterative method. Its main idea is to limit the information that passes through each node and it relaxes the constrain of preserving the labels of training data.

MAD (Talukdar and Crammer, ECML 2009) re-expressed the Adsorption method as a optimization problem. The labels of training data can also be changed.

Evaluation

The comparison was done on subsets of Freebase in 18 domains, TextRunner (Banko et al., IJCAI 2007) results. The evaluation metric used is the Mean Reciprocal Rank.

Results

MAD performed significantly better than the other two on Freebase datasets using classes from (Pantel et al., EMNLP 2009) and WordNet as the gold standards. Three methods performs comparably on TextRunner data using the classes from WordNet. From the evaluation, MAD performs the best when the average degree of nodes are high.

The authors also investigated the impact of semantic constrains (additional edges between instances and attributes) to these graph based learning methods. These constrains are from YAGO KB (Suchanek et al, WWW 2008). The evaluation shows it is very beneficial to use these semantic constrains for these graph based semi-supervised learning methods.

Related papers

Three graph based SSL methods can be found in Zhu et al., ICML 2003, Baluja et al., WWW 2008, and Talukdar and Crammer, ECML 2009.