Yeh et al WikiWalk Random walks on Wikipedia for Semantic Relatedness

From Cohen Courses
Revision as of 13:47, 30 September 2012 by Mmohta (talk | contribs) (Created page with 'This is a [[Category::Paper]] summarized for the course Analysis of Social Media 10-802 in Fall 2012. NOTE: WORK IN PROGRESS! == Citation == E. Yeh, D. Ramage, C. D. Manning, …')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

This is a Paper summarized for the course Analysis of Social Media 10-802 in Fall 2012.

NOTE: WORK IN PROGRESS!

Citation

E. Yeh, D. Ramage, C. D. Manning, E. Agirre, and A. Soroa, “Wikiwalk: random walks on wikipedia for semantic relatedness,” in Proceedings of TextGraphs-4 (2009 Workshop on Graph-based Methods for Natural Language Processing), Singapore, 2009, pp. 41–49.

Online version

Wikiwalk: Random Walks on Wikipedia for Semantic Relatedness

Summary

In this paper, the authors evaluate a method to find semantic relatedness between 2 pieces of short text / terms using the knowledge base of wikipedia. Based on the previous works, the paper aims to make use of both the content of wikipedia articles and the structure of links between different articles to compute the semantic similarity between the 2 input texts. The crux of the method is to create a customized teleport vector for the input texts and use it for personalized page rank. The resulting static distribution for each text is then compared using cosine similarity.

Evaluation

Related Work

The paper borrows ideas from many different papers in developing a way of measuring semantic relatedness. A few notable ideas used in the paper are:

  1. Personalized Page Rank: One way of viewing [page rank] (paper) is the probability that a random surfer will land on a page after following clicks from different pages. However, there can be sinks in the graph. For example, 2 nodes pointing to each other and no other node would result in the random surfer just oscillating between the 2 nodes once he lands on any of the node. In this case computing page rank won't be possible. Hence, in the original page rank paper, the authors also defined a term which models a jump to random page while performing the walk in the graph. This term was taken as uniform for all pages in the original page rank algorithm. However, in the later work, Topic Sensitive Page Rank, instead of taking the random jump vector for pages to be uniform, they are personalized based on the topic or the context. This helps in measuring the importance of pages with respect to different topics.
  2. Explicit Semantic Analysis: In the paper for [Explicit Semantic Analysis], the authors provide a way for 'wikifying' a given piece of text. Wikifying is essentially replacing the words in the text with wikipedia article titles (also called as wikipedia concepts). ESA maps the input text to a weighted vector of Wikipedia articles where the weights between words and concepts are calculated using a TF-IDF approach.


Main Ideas

This paper poses two important social problems related to bipartite social graphs and explained how those problems can be solved efficiently using random walks.

They also claim that the neighborhoods over nodes can represent personalized clusters depending on different perspectives.

During presentation one of the audiences raised question about is anomaly detection in this paper similar to betweenness of edges defined in Kleinber's text as discussed in Class Meeting for 10-802 01/26/2010. I think they are similar. In the texbook they propose, detecting edges with high betweenness and using them to partition the graph. In this paper they first try to create neighbourhood partitions based on random walk prbabilities and which as a by product gives us nodes and edges with high betweenness value.

Related papers

There has been a lot of work on anomaly detection in graphs.

  • The paper by Moonesinghe and Tan ICTAI06 finds the clusters of outlier objects by doing random walk on the weighted graph.
  • The paper by Aggarwal SIGMOD 2001 proposes techniques for projecting high dimensional data on lower dimensions to detect outliers.

Study plan

  • Article:Bipartite graph:[1]
  • Article:Anomaly detection:[2]
  • Paper:Topic sensitive pagerank:[3]
    • Paper:The PageRank Citation Ranking: Bringing Order to the Web:[4]
  • Paper:Multilevel k-way Partitioning Scheme for Irregular Graphs:[5]