Difference between revisions of "Lin and Wu. 2009. Phrase Clustering for Discriminative Learning."

From Cohen Courses
Jump to navigationJump to search
m
 
(20 intermediate revisions by one other user not shown)
Line 1: Line 1:
== Under construction ==
 
 
 
{{MyCiteconference | booktitle = Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing| coauthors = X. Wu| date = 2009| first = D.| last = Lin| title = Phrase clustering for discriminative learning| url = http://www.aclweb.org/anthology/P/P09/P09-1116.pdf }}
 
{{MyCiteconference | booktitle = Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing| coauthors = X. Wu| date = 2009| first = D.| last = Lin| title = Phrase clustering for discriminative learning| url = http://www.aclweb.org/anthology/P/P09/P09-1116.pdf }}
  
Line 7: Line 5:
 
== Summary ==
 
== Summary ==
  
This paper makes use of phrase clustering to improve on the state of the art for the [[AddressesProblem::Named Entity Recognition]] problem. They obtained 1 F-score improvement over NER systems on the CoNLL benchmark (in 2009). In their paper, phrases are basically queries that occur more than 100 times in a 700 billion token web corpus ([[RelatedPaper::Lin et al., 2008]]).
+
This paper makes use of phrase [[UsesMethod::clustering]] to improve on the state of the art for the [[AddressesProblem::Named Entity Recognition]] problem. They obtained 1.0 F-score improvement over NER systems on the CoNLL benchmark (in 2009). In their paper, they used phrases that occur more than 100 times in a 700 billion token web corpus ([[RelatedPaper::Lin et al., 2008]]).
  
This paper leverage on large amount of unlabeled data to induce phrase clustering, which turns out to be effective in improving NER systems.
+
This paper leveraged on large amount of unlabeled data to induce phrase [[clustering]], which provided an advantage over word [[clustering]] features used in current NER systems.
  
 
== Brief description of the method ==
 
== Brief description of the method ==
  
Due to the large number of possible phrases, the authors used Bloom filters to decide whether a sequence of tokens is considered a phrase.
+
Due to the large number of possible phrases, the authors used [[UsesMethod::Bloom filters]] to decide whether a sequence of tokens is considered a phrase.
  
 
=== Phrases as feature vectors ===
 
=== Phrases as feature vectors ===
  
Each phrase is represented as a vector of its context. The frequency count of words appearing within a fixed sized window is aggregated and converted into [[UsesMethod::pointwise mutual information]](PMI) values.
+
Each phrase is represented as a vector of its context words. The frequency count of words appearing within a fixed-sized window is aggregated and converted into [[UsesMethod::pointwise mutual information]](PMI) values.
  
 
=== Parallel K-Means using MapReduce ===
 
=== Parallel K-Means using MapReduce ===
  
The phrase vectors are then clustered using [[UsesMethod::K-means]] clustering algorithm, which can be easily parallelized.
+
The phrase vectors are then clustered using [[UsesMethod::K-means]] [[clustering]] algorithm. The authors chose the [[K-means]] algorithm over other more advanced [[clustering]] algorithms because it is fast and can be very easily parallelized. Given the amount of data they have, this is a very important consideration. It took about 20 minutes on a 1000 machine cluster to perform K-means to convergence on their dataset. The "similarity" between two phrases is simply the Euclidean norm of the phrase vectors in the feature space. What this means is that phrases are considered to be more similar if they share more context words.
  
Soft clustering can be done by assigning phrases to cluster centroids that are within a threshold distance. Soft clustering may be better able to model the fact that phrases may contain several "senses".
+
=== Soft clustering ===
 +
 
 +
In addition to just performing hard [[clustering]] (where each phrase can only belong to one cluster), they perform soft [[clustering]] as well by assigning phrases to cluster centroids that are within a threshold distance. The motivation behind doing soft [[clustering]] is that it may be better able to model the fact that phrases may contain several "senses".
  
 
== Experimental Result ==
 
== Experimental Result ==
  
The effectiveness of phrase clustering is evaluated on [[NER]] problem. For NER, they used 1-word context window and hard clustering, and a linear chain [[UsesMethod::CRF]] with standard NER features. The baseline features contains a total of 48 feature templates.
+
The effectiveness of phrase [[clustering]] is evaluated on [[AddressesProblem::Named Entity Recognition]]. For [[NER]], they used a 1-word context window and hard [[clustering]], and a linear chain [[UsesMethod::CRF]] with standard features used in current [[NER]] systems. The baseline features contains a total of 48 feature templates.
  
 
The results on [[UsesDataset::CoNLL'03]] test set are as follows:
 
The results on [[UsesDataset::CoNLL'03]] test set are as follows:
Line 33: Line 33:
 
[[Image:conll_results.png]]
 
[[Image:conll_results.png]]
  
The authors evaluated their phrase clustering system on the [[UsesDataset::KDDCup 2005]] competition. The task is to categorize 800k internet user search queries into 67 topical categories. They treated the problem as 67 separate binary classification task and trained [[UsesMethod::logistic regression]] classifiers with <math>L_2</math> regularization. Their system were on par with the winning KDDCup 2005 system.
+
The authors evaluated their phrase clustering system on the [[UsesDataset::KDDCup 2005]] [[AddressesProblem::query classification]] competition. The task is to categorize 800k internet user search queries into 67 topical categories. They treated the problem as 67 separate binary classification task and trained [[UsesMethod::logistic regression]] classifiers with <math>L_2</math> regularization. Their system were on par with the winning KDDCup 2005 system.
  
 
[[Image:kddcup_results.png]]
 
[[Image:kddcup_results.png]]
  
 
== Related Papers ==
 
== Related Papers ==
 +
 +
[[RelatedPaper::Lafferty_2001_Conditional_Random_Fields | Lafferty et al (2001)]] introduced [[CRF]] in their paper.
 +
 +
[[RelatedPaper::Brown et al, CL 1992]] introduced the [[UsesMethod::Brown clustering]] algorithm which is frequently used in NER for clustering words. It is based on the idea that one should minimize the bigram language model perplexity of a text corpus.
 +
 +
[[RelatedPaper::Miller et al, ACL 2004]] used a word clustering algorithm and discriminative training approach to the NER problem.

Latest revision as of 01:07, 30 September 2011

Phrase clustering for discriminative learning, by D. Lin, X. Wu. In Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, 2009.

This Paper is available online [1].

Summary

This paper makes use of phrase clustering to improve on the state of the art for the Named Entity Recognition problem. They obtained 1.0 F-score improvement over NER systems on the CoNLL benchmark (in 2009). In their paper, they used phrases that occur more than 100 times in a 700 billion token web corpus (Lin et al., 2008).

This paper leveraged on large amount of unlabeled data to induce phrase clustering, which provided an advantage over word clustering features used in current NER systems.

Brief description of the method

Due to the large number of possible phrases, the authors used Bloom filters to decide whether a sequence of tokens is considered a phrase.

Phrases as feature vectors

Each phrase is represented as a vector of its context words. The frequency count of words appearing within a fixed-sized window is aggregated and converted into pointwise mutual information(PMI) values.

Parallel K-Means using MapReduce

The phrase vectors are then clustered using K-means clustering algorithm. The authors chose the K-means algorithm over other more advanced clustering algorithms because it is fast and can be very easily parallelized. Given the amount of data they have, this is a very important consideration. It took about 20 minutes on a 1000 machine cluster to perform K-means to convergence on their dataset. The "similarity" between two phrases is simply the Euclidean norm of the phrase vectors in the feature space. What this means is that phrases are considered to be more similar if they share more context words.

Soft clustering

In addition to just performing hard clustering (where each phrase can only belong to one cluster), they perform soft clustering as well by assigning phrases to cluster centroids that are within a threshold distance. The motivation behind doing soft clustering is that it may be better able to model the fact that phrases may contain several "senses".

Experimental Result

The effectiveness of phrase clustering is evaluated on Named Entity Recognition. For NER, they used a 1-word context window and hard clustering, and a linear chain CRF with standard features used in current NER systems. The baseline features contains a total of 48 feature templates.

The results on CoNLL'03 test set are as follows:

Conll results.png

The authors evaluated their phrase clustering system on the KDDCup 2005 query classification competition. The task is to categorize 800k internet user search queries into 67 topical categories. They treated the problem as 67 separate binary classification task and trained logistic regression classifiers with regularization. Their system were on par with the winning KDDCup 2005 system.

Kddcup results.png

Related Papers

Lafferty et al (2001) introduced CRF in their paper.

Brown et al, CL 1992 introduced the Brown clustering algorithm which is frequently used in NER for clustering words. It is based on the idea that one should minimize the bigram language model perplexity of a text corpus.

Miller et al, ACL 2004 used a word clustering algorithm and discriminative training approach to the NER problem.