Difference between revisions of "The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity"

From Cohen Courses
Jump to navigationJump to search
 
Line 24: Line 24:
 
== Summary ==
 
== Summary ==
  
This [[Category::paper]] presents a [[UsesMethod:joint probabilistic model]] for documents accounting both their terms and citations. The joint model is a joint of [[UsesMethod:PLSA]] (for document term modeling) and PHITS (for modeling the inlinks/citations to a document). Their hypothesis is that using information from both the document's contents and its citations helps in providing a better model. They verify it with the following experiments:
+
This [[Category::paper]] presents a [[UsesMethod::joint probabilistic model]] for documents accounting both their terms and citations. The joint model is a joint of [[UsesMethod::PLSA]] (for document term modeling) and PHITS (for modeling the inlinks/citations to a document). Their hypothesis is that using information from both the document's contents and its citations helps in providing a better model. They verify it with the following experiments:
  
 
=== Classification ===
 
=== Classification ===
Line 45: Line 45:
 
It is important to note that their model is different from mixture models.The difference is in the basic assumption of both the models. Mixture models assume that every object comes from ''one'' latent source from a set of sources whereas , the factored model they use assumes that each object comes from a mixture of sources.
 
It is important to note that their model is different from mixture models.The difference is in the basic assumption of both the models. Mixture models assume that every object comes from ''one'' latent source from a set of sources whereas , the factored model they use assumes that each object comes from a mixture of sources.
  
For clustering the hyperlinks, they use [[UsesMethod:PHITS]] which is mathematically identical to PLSA with the variation: instead of modeling the citations contained within a document, PHITS models “inlinks,” the citations to a document.
+
For clustering the hyperlinks, they use [[UsesMethod::PHITS]] which is mathematically identical to PLSA with the variation: instead of modeling the citations contained within a document, PHITS models “inlinks,” the citations to a document.
  
 
== Datasets ==
 
== Datasets ==

Latest revision as of 17:08, 5 November 2012

Citation

   author = {David Cohn and Thomas Hofmann},
   title = {The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity},
   year = {2001}


Online version

http://www.cs.cmu.edu/~cohn/papers/nips00.pdf


Applications

  • Identifying topics and common subjects covered by documents.
  • Identifying authoritative documents on a given topic.
  • Predictive navigation.
  • Web authoring support.

Summary

This paper presents a joint probabilistic model for documents accounting both their terms and citations. The joint model is a joint of PLSA (for document term modeling) and PHITS (for modeling the inlinks/citations to a document). Their hypothesis is that using information from both the document's contents and its citations helps in providing a better model. They verify it with the following experiments:

Classification

They classified the dominant factors of each document generated by the joint model, using human-derived classes. They used factored nearest neighbor approach, where the similarity was measured by cosines of their projections in factor space.

Results: Resul.png

Reference Flow

They generate an interesting graph of topics connected with links. They call it a reference flow. It is a graph with topics ( generated by terms) as nodes and the hyperlinks/citations as edges. It gives an interesting visualization of how topics are connected. Since they use the joint model, the reference flow will also have links between topics that are mentioned in documents that are not connected.

They use these reference flows to do "intelligent web crawling", their results are not too bad. By intelligent web crawling, they mean being able to predict, which links might lead us to a document which has the desired topics.


Background

They use a variation of LSA called PLSA . Unlike, LSA which performs a SVD and uses left/right principal eigen vectors as factors, PLSA uses a probabilistic decomposition.

It is important to note that their model is different from mixture models.The difference is in the basic assumption of both the models. Mixture models assume that every object comes from one latent source from a set of sources whereas , the factored model they use assumes that each object comes from a mixture of sources.

For clustering the hyperlinks, they use PHITS which is mathematically identical to PLSA with the variation: instead of modeling the citations contained within a document, PHITS models “inlinks,” the citations to a document.

Datasets

  • WebKB - approximately 6000 web pages from computer science departments, classified by school and category (student, course, faculty, etc.).
  • Cora - abstracts and references of approximately 34,000 computer science research papers, out of which they used 2000 papers categorized into one of the sub fields of Machine learning.

Study Plan

  • LSA - S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by

latent semantic analysis. J. of the American Society for Information Science, 41:391–407, 1990.

  • PLSA - T. Hofmann. Probabilistic latent semantic analysis. In Proceedings of the 15th Conference on

Uncertainty in AI, pages 289–296, 1999.

  • PHITS - D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In

Proceedings of the 17th International Conference on Machine Learning, 2000.

Related Work

  • K. Bharat and M. R. Henzinger. Improved algorithms for topic distillation in hyperlinked environments.

In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998.

  • L. Getoor, N. Friedman, D. Koller, and A. Pfeffer. Learning probabilistic relational models. In

S. Dzeroski and N. Lavrac, editors, Relational Data Mining. Springer-Verlag, 2001.

  • J. Kleinberg. Authoritative sources in a hyperlinked environment. In Proc. 9th ACM-SIAM

Symposium on Discrete Algorithms, 1998.

  • Cora - A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of internet

portals with machine learning. Information Retrieval Journal, 3:127–163, 2000.