Difference between revisions of "Co-clustering documents and words using bipartite spectral graph partitioning"

From Cohen Courses
Jump to navigationJump to search
Line 6: Line 6:
  
 
== Summary ==
 
== Summary ==
This is a [[Category::paper]] investigating [[AddressesProblem::the structure of scientific collaboration]]. The author ulitized data from a number of databases in different fields: Biomedical, Physics and Computer Science. Properties of these networks are:
+
Both document clustering and word clustering are well studied problems. Most existing algorithms cluster documents and words separately but not simultaneously. In this paper we present the novel idea of modeling the document collection as a bipartite graph between documents and words, using which the simultaneous clustering problem can be posed as a bipartite graph partitioning problem. To solve the partitioning problem, we use a new spectral co-clustering algorithm that uses the second left and right singular vectors of an appropriately scaled word-document matrix to yield good bipartitionings. The spectral algorithm enjoys some optimality properties; it can be shown that the singular vectors solve a real relaxation to the NP-complete graph bipartitioning problem. We present experimental results to verify that the resulting co-clustering algorithm works well in practice.
 
 
* In all cases, scientific communities seem to constitute a ‘‘small world,’’[http://en.wikipedia.org/wiki/Small-world_network] in which the average distance between scientists via a line of intermediate collaborators varies logarithmically with the size of the relevant community.
 
 
 
* Those networks are highly clustered, meaning that two scientists are much more likely to have collaborated if they have a third common collaborator than are two scientists chosen at random from the community.
 
 
 
* Distributions of both the number of collaborators of scientists and the numbers of papers are well fit by power-law forms with an exponential cutoff. This cutoff may be caused by the finite time window (1995-1999) used in the study.
 
 
 
* There are a number of significant statistical differences between different scientific communities. Some of these are obvious.
 

Revision as of 01:52, 28 March 2011

Citation

Inderjit S. Dhillon. 2001. Co-clustering documents and words using bipartite spectral graph partitioning. KDD.

Online Version

http://www.cs.utexas.edu/users/inderjit/public_papers/kdd_bipartite.pdf

Summary

Both document clustering and word clustering are well studied problems. Most existing algorithms cluster documents and words separately but not simultaneously. In this paper we present the novel idea of modeling the document collection as a bipartite graph between documents and words, using which the simultaneous clustering problem can be posed as a bipartite graph partitioning problem. To solve the partitioning problem, we use a new spectral co-clustering algorithm that uses the second left and right singular vectors of an appropriately scaled word-document matrix to yield good bipartitionings. The spectral algorithm enjoys some optimality properties; it can be shown that the singular vectors solve a real relaxation to the NP-complete graph bipartitioning problem. We present experimental results to verify that the resulting co-clustering algorithm works well in practice.