Difference between revisions of "Talukdar and Pereira ACL 2010"

From Cohen Courses
Jump to navigationJump to search
m (Created page with '== Citation == Partha Pratim Talukdar and Fernando Pereira. 2010. Experiments in graph-based semi-supervised learning methods for class-instance acquisition. In Proceedings of t…')
 
m
Line 18: Line 18:
 
LP-ZGL (Zhu et al., ICML 2003) is the first graph based semi-supervised learning method. It propagates the labels of training data by ensuring the smoothness of the label assignment and preserving the labels of the training data. The smoothness (manifold assumption) of the label assignment implies the two highly connected nodes in the graph should have same or similar labels.
 
LP-ZGL (Zhu et al., ICML 2003) is the first graph based semi-supervised learning method. It propagates the labels of training data by ensuring the smoothness of the label assignment and preserving the labels of the training data. The smoothness (manifold assumption) of the label assignment implies the two highly connected nodes in the graph should have same or similar labels.
  
Adsorption (Baluja et al., WWW 2008) uses an iterative method.  
+
Adsorption (Baluja et al., WWW 2008) uses an iterative method. Its main idea is to limit the information that passes through each node and it relax the constrain of preserving the labels of training data.
** Schema of infobox for a class was defined by first grouping articles with the same infobox template names and then selecting the most common attributes (used in >15% articles) from them.
 
** Training data was generated by selecting (using heuristics) a unique sentence in the documents that contain attributes as the positive sample. The rest of the sentences in the documents are used as negative samples.
 
  
* Document & Sentence Classification
+
MAD (Talukdar and Crammer, ECML 2009) re-expressed the Adsorption method as a optimization problem.  
** A candidate document is identified using a heuristic approach: 1) to find list pages that match infobox class keywords, 2) and then classify the articles from the list pages based on their category tags.
 
** A candidate sentence is identified using a classifier [[UsesMethod::Maximum Entropy model|MaxEnt]] with bagging [[UsesMethod::bagging|bagging]] with features: words and their POS tags.  
 
  
* Attribute Extraction
+
== Evaluation ==
** Negative training examples are ignored if sentences were classified as an candidate sentence in the previous step.
+
The comparison was done on subsets of [[UsesDataset::Freebase|Freease]] in 18 domains, TextRunner (Banko et al., IJCAI 2007) results. The evaluation metric used is the Mean Reciprocal Rank.
** Attribute values are identified using [[UsesMethod::Conditional Random fields|CRF]], one classifier for each attribute.
 
  
Link Generation was done also rather heuristically. The evaluation was done on [[UsesDataset::Wikipedia|Wikipedia 2007.02.06 data]].
+
=== Results ===
 +
MAD performed significantly better than the other two on Freebase datasets using classes from (Pantel et al., EMNLP 2009) and WordNet as the gold standard. Three methods performs comparably on TextRunner data using the classes from WordNet. From the evaluation, MAD performs best when the average degree of nodes are high.
 +
 
 +
The authors also investigated the impact of semantic constrains (additional edges between instances and attributes). These constrains are from YAGO KB (Suchanek et al, WWW 2008). The evaluation shows it is very beneficial to use these semantic constrains for these graph based semi-supervised learning methods.
  
 
== Related papers ==
 
== Related papers ==
This prototype was later used in a more general task of open domain information extraction task in [[Wu_and_Weld_ACL_2010]].
+
Three graph based SSL methods can be found in Zhu et al., ICML 2003, Baluja et al., WWW 2008, and Talukdar and Crammer, ECML 2009.

Revision as of 14:10, 30 November 2010

Citation

Partha Pratim Talukdar and Fernando Pereira. 2010. Experiments in graph-based semi-supervised learning methods for class-instance acquisition. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL '10). Association for Computational Linguistics, Morristown, NJ, USA, 1473-1481.

Online version

ACL Anthology

Summary

This paper paper conducted an empirical comparison of three graph based semi-supervised learning methods for the Class-Instance Acquisition task.

Motivation

Traditional NER have focused on a small number of classes such as person and location. These classes are too broad to be useful for applications like word sense disambiguation and textual inference in practice. We have limited training data for supervised learning methods for the fine-grained classification. Therefore seed-based information extraction systems have been developed to extract new instances of a class from unstructured text using a few seed instances of that class.

Methods Compared

The general idea of graph based semi-supervised learning method works as follows: Given a connectivity graph which contains both labeled and unlabeled data, the labels of the labeled data are propagated to the unlabeled data through the graph with some constrains.

LP-ZGL (Zhu et al., ICML 2003) is the first graph based semi-supervised learning method. It propagates the labels of training data by ensuring the smoothness of the label assignment and preserving the labels of the training data. The smoothness (manifold assumption) of the label assignment implies the two highly connected nodes in the graph should have same or similar labels.

Adsorption (Baluja et al., WWW 2008) uses an iterative method. Its main idea is to limit the information that passes through each node and it relax the constrain of preserving the labels of training data.

MAD (Talukdar and Crammer, ECML 2009) re-expressed the Adsorption method as a optimization problem.

Evaluation

The comparison was done on subsets of Freease in 18 domains, TextRunner (Banko et al., IJCAI 2007) results. The evaluation metric used is the Mean Reciprocal Rank.

Results

MAD performed significantly better than the other two on Freebase datasets using classes from (Pantel et al., EMNLP 2009) and WordNet as the gold standard. Three methods performs comparably on TextRunner data using the classes from WordNet. From the evaluation, MAD performs best when the average degree of nodes are high.

The authors also investigated the impact of semantic constrains (additional edges between instances and attributes). These constrains are from YAGO KB (Suchanek et al, WWW 2008). The evaluation shows it is very beneficial to use these semantic constrains for these graph based semi-supervised learning methods.

Related papers

Three graph based SSL methods can be found in Zhu et al., ICML 2003, Baluja et al., WWW 2008, and Talukdar and Crammer, ECML 2009.