Difference between revisions of "Brown clustering"

From Cohen Courses
Jump to navigationJump to search
(Created page with '''Brown clustering'' is a [[Category::Method|method]] used to create clusters of words.')
 
m
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
''Brown clustering'' is a [[Category::Method|method]] used to create clusters of words.
+
''Brown clustering'' is a [[Category::Method|method]] used to create clusters of words that are similar. It is an instance of a [[Clustering]] algorithm which generates a hierarchical cluster of words.
 +
 
 +
==Basic Idea==
 +
Every word is assigned a bit-string. The bits to the left are the most significant. If a word's significant bits agree with another word's, those words will be in a similar category. If you can imagine a tree of words, the leafs are words that are assigned a bit-string, depending on the position in the tree. Words close together will have closer bit-strings. Another advantage of this method is that the number of clusters can be chosen. If you only examine the first <math>k</math> bits of a string, you will generate a cluster of size <math>2^k</math>. For example, using the first few bits would give very general clusters (a la part of speech granularity). Using all the bits but one would give clusters of size 2.
 +
 
 +
==Algorithm==
 +
The input is a large corpus of words. The output is a tree of words, each with an assigned bit-string. An example output is shown below in list form.
 +
 
 +
[[File:sample-bitstrings.png]]
 +
 
 +
One algorithm proceeds as follows:
 +
* Begin with every word in its own cluster
 +
* Until we have one cluster:
 +
** Merge two clusters based on some measure of quality (ex: "smallest decrease in likelihood of the text, according to a class-based bigram language model defined on the word clusters" in [[RelatedPaper::Koo et al, ACL 2008]] or average mutual information)
 +
* Assign each word a bit-string based on the clusters it was in and when they were merged. Ex: Each isolated cluster will have no bit-string. Every time we merge two clusters, set the bit-strings in one cluster to "0" + c1_old_bit_string and the other to "1" + c2_old_bit_string
 +
 
 +
==Uses==
 +
This has been used in language models and as extra features in all sorts of NLP tasks. Due to the variable number of clusters, this clustering allows for fine or coarse grain control of how to group words with little work on the part of the user.
 +
 
 +
==Drawbacks==
 +
People have tried to use Brown clustering to group syntactically similar words together. While the clustering is highly dependent on the merging criterion, the resulting clustering is usually not syntactically similar words. In [[RelatedPaper::Koo et al, ACL 2008]], they found that using both part of speech tags and bit-strings made for better features than just bit-strings because bit-strings didn't sufficiently differentiate between syntactic function.
 +
 
 +
==Resources==
 +
* [http://www.cs.columbia.edu/~mcollins/courses/e6998-3/lectures/lec11.pdf Lectures slides by Collins]
 +
* [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.157.935&rep=rep1&type=pdf A Word Clustering Approach for Language Model-based Sentence Retrieval in Question Answering Systems] is a paper that expands on the Brown clustering.

Latest revision as of 02:23, 30 September 2011

Brown clustering is a method used to create clusters of words that are similar. It is an instance of a Clustering algorithm which generates a hierarchical cluster of words.

Basic Idea

Every word is assigned a bit-string. The bits to the left are the most significant. If a word's significant bits agree with another word's, those words will be in a similar category. If you can imagine a tree of words, the leafs are words that are assigned a bit-string, depending on the position in the tree. Words close together will have closer bit-strings. Another advantage of this method is that the number of clusters can be chosen. If you only examine the first bits of a string, you will generate a cluster of size . For example, using the first few bits would give very general clusters (a la part of speech granularity). Using all the bits but one would give clusters of size 2.

Algorithm

The input is a large corpus of words. The output is a tree of words, each with an assigned bit-string. An example output is shown below in list form.

Sample-bitstrings.png

One algorithm proceeds as follows:

  • Begin with every word in its own cluster
  • Until we have one cluster:
    • Merge two clusters based on some measure of quality (ex: "smallest decrease in likelihood of the text, according to a class-based bigram language model defined on the word clusters" in Koo et al, ACL 2008 or average mutual information)
  • Assign each word a bit-string based on the clusters it was in and when they were merged. Ex: Each isolated cluster will have no bit-string. Every time we merge two clusters, set the bit-strings in one cluster to "0" + c1_old_bit_string and the other to "1" + c2_old_bit_string

Uses

This has been used in language models and as extra features in all sorts of NLP tasks. Due to the variable number of clusters, this clustering allows for fine or coarse grain control of how to group words with little work on the part of the user.

Drawbacks

People have tried to use Brown clustering to group syntactically similar words together. While the clustering is highly dependent on the merging criterion, the resulting clustering is usually not syntactically similar words. In Koo et al, ACL 2008, they found that using both part of speech tags and bit-strings made for better features than just bit-strings because bit-strings didn't sufficiently differentiate between syntactic function.

Resources