Difference between revisions of "Cohn et al, Advances in Neural Information Processing Systems 2001"

From Cohen Courses
Jump to navigationJump to search
 
(6 intermediate revisions by the same user not shown)
Line 9: Line 9:
 
== Summary ==
 
== Summary ==
  
This [[Category::paper]] presents an interesting approach for [[AddressesProblem:: Community Detection]]. The basic ideas are:
+
This [[Category::paper]] presents an interesting approach of jointly modeling the document and link generation. Potential applications include [[AddressesProblem:: Community Detection]] . The basic ideas are:
  
 
* Use a corpus of articles which have links between them. Examples of such articles are webpages with hyperlinks, scientific articles with citations etc.
 
* Use a corpus of articles which have links between them. Examples of such articles are webpages with hyperlinks, scientific articles with citations etc.
Line 18: Line 18:
  
 
== Brief description of the method ==
 
== Brief description of the method ==
The paper describes a method in which the document generation and link generation can be combined by using already known probabilistic version of [[UsesMethod:: LSA]] and [[UsesMethod:: HITS]] algorithm. More specifically both the terms in a document and the links present in the document are generated over a document-specific mixing proportion of factors. For all practical purposes these factors can be considered as topics which are multinomials over the entire vocabulary as in [UsesMethod:: Latent Dirichlet Allocation]]. The standard method used is to evaluate an expression for the joint likelihood of the corpus and then use [[UsesMethod:: Expectation Maximization]] to compute the topic conditional distribution and the mixing proportions of the document.
+
The paper describes a method in which the document generation and link generation can be combined by using already known probabilistic version of [[UsesMethod:: Latent semantic indexing]] and [[UsesMethod:: HITS]] algorithm. More specifically both the terms in a document and the links present in the document are generated over a document-specific mixing proportion of factors. For all practical purposes these factors can be considered as topics which are multinomials over the entire vocabulary as in [[UsesMethod:: Latent Dirichlet Allocation]]. The standard method used is to evaluate an expression for the joint likelihood of the corpus and then use [[UsesMethod:: Expectation Maximization]] to compute the topic conditional distribution and the mixing proportions of the document.
  
 
== Experimental Result ==
 
== Experimental Result ==
Human judges were used to evaluate the appropriateness of tags for posts. The system could not out perform the manual tags for the blog posts which is not surprising. The original tags accuracy is also pretty low which suggests  that humans also face problem while tagging the blogs which can be understood since there is a lack of incentive for accurately tagging a blog post for indexing or searching. An automated evaluation of 1000 blog posts against the baseline showed that the system excels over the baseline in precision metric but underperforms in recall metric. This experiment was carried out on the [[UsesDataset:: Technorati Dataset]].
+
The author used external tasks to verify the usability of the joint model. The first evaluation task was that of classification of web-pages [[UsesDataset:: Web KB dataset]] and abstracts from [[UsesDataset:: Cora network]]. The classification was done using a nearest neighbor method where the proximity was computed using [UsesMethod:: Cosine Similarity]. The joint model shows higher accuracy than either of the model in isolation however, no statistical significance testing was carried out. The second evaluation task was to predict a quantity called reference flow which could be used to predict link between a source and target document. In comparison to a placebo link detector the joint model performs significantly better.
  
 
== Related papers ==
 
== Related papers ==
 
+
One of the first systems which was used for tagging purposes was TagIt .
+
An interesting related paper is [[RelatedPaper::Cohn, D. ICML 2000]] which proposes a latent variable model for citation.
An interesting related paper is [[RelatedPaper::Mishne, G. WWW 2006]] which used collaborative filtering over the related blog posts to suggest a set of tags for a target post.
 

Latest revision as of 21:01, 28 March 2011

Citation

Cohn et al. The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity. Advances in Neural Information Processing Systems.

Online version

NIPS

Summary

This paper presents an interesting approach of jointly modeling the document and link generation. Potential applications include Community Detection . The basic ideas are:

  • Use a corpus of articles which have links between them. Examples of such articles are webpages with hyperlinks, scientific articles with citations etc.
  • Build a Topic Model which could jointly model the documents along with the citations between the documents. Both the words and citations in a document are dependent on the topic proportion present in the document.

Brief description of the method

The paper describes a method in which the document generation and link generation can be combined by using already known probabilistic version of Latent semantic indexing and HITS algorithm. More specifically both the terms in a document and the links present in the document are generated over a document-specific mixing proportion of factors. For all practical purposes these factors can be considered as topics which are multinomials over the entire vocabulary as in Latent Dirichlet Allocation. The standard method used is to evaluate an expression for the joint likelihood of the corpus and then use Expectation Maximization to compute the topic conditional distribution and the mixing proportions of the document.

Experimental Result

The author used external tasks to verify the usability of the joint model. The first evaluation task was that of classification of web-pages Web KB dataset and abstracts from Cora network. The classification was done using a nearest neighbor method where the proximity was computed using [UsesMethod:: Cosine Similarity]. The joint model shows higher accuracy than either of the model in isolation however, no statistical significance testing was carried out. The second evaluation task was to predict a quantity called reference flow which could be used to predict link between a source and target document. In comparison to a placebo link detector the joint model performs significantly better.

Related papers

An interesting related paper is Cohn, D. ICML 2000 which proposes a latent variable model for citation.