Difference between revisions of "Brown clustering"

Revision as of 03:11, 30 September 2011

Brown clustering is a method used to create clusters of words that are similar. It is an instance of a Clustering algorithm which generates a hierarchical cluster of words.

Basic Idea

Every word is assigned a bit-string. The bits to the left are the most significant. If a word's significant bits agree with another word's, those words will be in a similar category. If you can imagine a tree of words, the leafs are words that are assigned a bit-string, depending on the position in the tree. Words close together will have closer bit-strings. Another advantage of this method is that the number of clusters can be chosen. If you only examine the first $k$ bits of a string, you will generate a cluster of size $2^{k}$ . For example, using the first few bits would give very general clusters (a la part of speech granularity). Using all the bits but one would give clusters of size 2.

Algorithm

The input is a large corpus of words. The output is a tree of words, each with an assigned bit-string. An example output is shown below in list form.

Resources

Lectures slides by Collins
A Word Clustering Approach for Language Model-based Sentence Retrieval in Question Answering Systems is a paper that expands on the Brown clustering.

@@ Line 1: / Line 1: @@
-''Brown clustering'' is a [[Category::Method|method]] used to create clusters of words.
+''Brown clustering'' is a [[Category::Method|method]] used to create clusters of words that are similar. It is an instance of a [[Clustering]] algorithm which generates a hierarchical cluster of words.
+==Basic Idea==
+Every word is assigned a bit-string. The bits to the left are the most significant. If a word's significant bits agree with another word's, those words will be in a similar category. If you can imagine a tree of words, the leafs are words that are assigned a bit-string, depending on the position in the tree. Words close together will have closer bit-strings. Another advantage of this method is that the number of clusters can be chosen. If you only examine the first <math>k</math> bits of a string, you will generate a cluster of size <math>2^k</math>. For example, using the first few bits would give very general clusters (a la part of speech granularity). Using all the bits but one would give clusters of size 2.
+==Algorithm==
+The input is a large corpus of words. The output is a tree of words, each with an assigned bit-string. An example output is shown below in list form.
+[[File:sample-bitstrings.png]]
+==Resources==
+* [http://www.cs.columbia.edu/~mcollins/courses/e6998-3/lectures/lec11.pdf Lectures slides by Collins]
+* [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.157.935&rep=rep1&type=pdf A Word Clustering Approach for Language Model-based Sentence Retrieval in Question Answering Systems] is a paper that expands on the Brown clustering.

Difference between revisions of "Brown clustering"

Revision as of 03:11, 30 September 2011

Basic Idea

Algorithm

Resources

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools