Cohn et al, Advances in Neural Information Processing Systems 2001

From Cohen Courses
Jump to navigationJump to search

Citation

Cohn et al. The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity. Advances in Neural Information Processing Systems.

Online version

NIPS

Summary

This paper presents an interesting approach for Community Detection. The basic ideas are:

  • Use a corpus of articles which have links between them. Examples of such articles are webpages with hyperlinks, scientific articles with citations etc.
  • Build a Topic Model which could jointly model the documents along with the citations between the documents. Both the words and citations in a document are dependent on the topic proportion present in the document.

Brief description of the method

The paper describes a method in which the document generation and link generation can be combined by using already known probabilistic version of LSA and HITS algorithm. More specifically both the terms in a document and the links present in the document are generated over a document-specific mixing proportion of factors. For all practical purposes these factors can be considered as topics which are multinomials over the entire vocabulary as in Latent Dirichlet Allocation. The standard method used is to evaluate an expression for the joint likelihood of the corpus and then use Expectation Maximization to compute the topic conditional distribution and the mixing proportions of the document.

Experimental Result

Human judges were used to evaluate the appropriateness of tags for posts. The system could not out perform the manual tags for the blog posts which is not surprising. The original tags accuracy is also pretty low which suggests that humans also face problem while tagging the blogs which can be understood since there is a lack of incentive for accurately tagging a blog post for indexing or searching. An automated evaluation of 1000 blog posts against the baseline showed that the system excels over the baseline in precision metric but underperforms in recall metric. This experiment was carried out on the Technorati Dataset.

Related papers

One of the first systems which was used for tagging purposes was TagIt . An interesting related paper is Mishne, G. WWW 2006 which used collaborative filtering over the related blog posts to suggest a set of tags for a target post.