Cohn et al, Advances in Neural Information Processing Systems 2001

From Cohen Courses
Jump to navigationJump to search

Citation

Cohn et al. The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity. Advances in Neural Information Processing Systems.

Online version

NIPS

Summary

This paper presents an interesting approach of jointly modeling the document and link generation. Potential applications include Community Detection . The basic ideas are:

  • Use a corpus of articles which have links between them. Examples of such articles are webpages with hyperlinks, scientific articles with citations etc.
  • Build a Topic Model which could jointly model the documents along with the citations between the documents. Both the words and citations in a document are dependent on the topic proportion present in the document.

Brief description of the method

The paper describes a method in which the document generation and link generation can be combined by using already known probabilistic version of Latent semantic indexing and HITS algorithm. More specifically both the terms in a document and the links present in the document are generated over a document-specific mixing proportion of factors. For all practical purposes these factors can be considered as topics which are multinomials over the entire vocabulary as in Latent Dirichlet Allocation. The standard method used is to evaluate an expression for the joint likelihood of the corpus and then use Expectation Maximization to compute the topic conditional distribution and the mixing proportions of the document.

Experimental Result

The author used external tasks to verify the usability of the joint model. The first evaluation task was that of classification of web-pages Web KB dataset and abstracts from Cora network. The classification was done using a nearest neighbor method where the proximity was computed using [UsesMethod:: Cosine Similarity]. The joint model shows higher accuracy than either of the model in isolation however, no statistical significance testing was carried out. The second evaluation task was to predict a quantity called reference flow which could be used to predict link between a source and target document. In comparison to a placebo link detector the joint model performs significantly better.

Related papers

An interesting related paper is Cohn, D. ICML 2000 which proposes a latent variable model for citation.