Difference between revisions of "Yeh et al WikiWalk Random walks on Wikipedia for Semantic Relatedness"

From Cohen Courses
Jump to navigationJump to search
(Created page with 'This is a [[Category::Paper]] summarized for the course Analysis of Social Media 10-802 in Fall 2012. NOTE: WORK IN PROGRESS! == Citation == E. Yeh, D. Ramage, C. D. Manning, …')
 
Line 22: Line 22:
  
 
== Main Ideas ==
 
== Main Ideas ==
This paper poses two important social problems related to bipartite social graphs and explained how those problems can be solved efficiently using random walks.
+
To perform a random walk on wikipedia with personalization, there are following 2 essential components:
 +
# How to construct wikipedia graph
 +
# How to construct the teleport vector.
 +
 +
===Constructing Wikipedia Graph===
 +
The obvious way of creating a graph is to consider nodes as articles and hyperlinks as links in the graph. This approach has been used by many different papers and the data is available either in [http://wiki.dbpedia.org/Datasets DBPedia] or [http://wikipedia-miner.cms.waikato.ac.nz/ Wikipedia-Miner]. The paper uses this to create the graph but distinguishes between [http://en.wikipedia.org/wiki/Help:Infobox infobox] links, [http://en.wikipedia.org/wiki/Help:Category category] links and links in the article text.
  
They also claim that the neighborhoods over nodes can represent personalized clusters depending on different perspectives.
+
Links are also factored out based on the concept of generality introduced in this [http://www.jair.org/media/2669/live-2669-4346-jair.pdf paper]. There are some techniques mentioned to prune the graph as well like ignoring articles with very few stop words and very few links.
  
During presentation one of the audiences raised question about is anomaly detection in this paper similar to betweenness of edges defined in Kleinber's text as discussed in [[Class Meeting for 10-802 01/26/2010]]. I think they are similar. In the texbook they propose, detecting edges with high betweenness and using them to partition the graph. In this paper they first try to create neighbourhood partitions based on random walk prbabilities and which as a by product gives us nodes and edges with high betweenness value.
+
=== Teleportation Vector ===
 +
Teleport vector represents the initial vector of mass over articles for a given input text. The paper discusses 2 ways to initialize these the teleport vector.
 +
==== Dictionary Based Initialization ====
 +
Different ways of creating a dictionary from words to wikipedia articles are explored in the paper. To summarize, some of them are article title directly, redirection pages and disambiguation pages, [http://en.wikipedia.org/wiki/Help:Anchor#Section_linking_.28anchors.29 anchors] with different constraints on the percentage of anchors, etc.
 +
 
 +
==== ESA based initialization ====
 +
ESA is a way to convert a given text to a weighted vector of wikipedia articles. ESA uses the text of the article body instead of anchor text or the article titles to compute the word to concept weights. A TF-IDF scheme is used for this. ESA scheme was successfully used in this [http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-259.pdf paper] to compute semantic relatedness and hence is a good teleport vector for the given text.
  
 
== Related papers ==
 
== Related papers ==

Revision as of 14:50, 30 September 2012

This is a Paper summarized for the course Analysis of Social Media 10-802 in Fall 2012.

NOTE: WORK IN PROGRESS!

Citation

E. Yeh, D. Ramage, C. D. Manning, E. Agirre, and A. Soroa, “Wikiwalk: random walks on wikipedia for semantic relatedness,” in Proceedings of TextGraphs-4 (2009 Workshop on Graph-based Methods for Natural Language Processing), Singapore, 2009, pp. 41–49.

Online version

Wikiwalk: Random Walks on Wikipedia for Semantic Relatedness

Summary

In this paper, the authors evaluate a method to find semantic relatedness between 2 pieces of short text / terms using the knowledge base of wikipedia. Based on the previous works, the paper aims to make use of both the content of wikipedia articles and the structure of links between different articles to compute the semantic similarity between the 2 input texts. The crux of the method is to create a customized teleport vector for the input texts and use it for personalized page rank. The resulting static distribution for each text is then compared using cosine similarity.

Evaluation

Related Work

The paper borrows ideas from many different papers in developing a way of measuring semantic relatedness. A few notable ideas used in the paper are:

  1. Personalized Page Rank: One way of viewing [page rank] (paper) is the probability that a random surfer will land on a page after following clicks from different pages. However, there can be sinks in the graph. For example, 2 nodes pointing to each other and no other node would result in the random surfer just oscillating between the 2 nodes once he lands on any of the node. In this case computing page rank won't be possible. Hence, in the original page rank paper, the authors also defined a term which models a jump to random page while performing the walk in the graph. This term was taken as uniform for all pages in the original page rank algorithm. However, in the later work, Topic Sensitive Page Rank, instead of taking the random jump vector for pages to be uniform, they are personalized based on the topic or the context. This helps in measuring the importance of pages with respect to different topics.
  2. Explicit Semantic Analysis: In the paper for [Explicit Semantic Analysis], the authors provide a way for 'wikifying' a given piece of text. Wikifying is essentially replacing the words in the text with wikipedia article titles (also called as wikipedia concepts). ESA maps the input text to a weighted vector of Wikipedia articles where the weights between words and concepts are calculated using a TF-IDF approach.


Main Ideas

To perform a random walk on wikipedia with personalization, there are following 2 essential components:

  1. How to construct wikipedia graph
  2. How to construct the teleport vector.

Constructing Wikipedia Graph

The obvious way of creating a graph is to consider nodes as articles and hyperlinks as links in the graph. This approach has been used by many different papers and the data is available either in DBPedia or Wikipedia-Miner. The paper uses this to create the graph but distinguishes between infobox links, category links and links in the article text.

Links are also factored out based on the concept of generality introduced in this paper. There are some techniques mentioned to prune the graph as well like ignoring articles with very few stop words and very few links.

Teleportation Vector

Teleport vector represents the initial vector of mass over articles for a given input text. The paper discusses 2 ways to initialize these the teleport vector.

Dictionary Based Initialization

Different ways of creating a dictionary from words to wikipedia articles are explored in the paper. To summarize, some of them are article title directly, redirection pages and disambiguation pages, anchors with different constraints on the percentage of anchors, etc.

ESA based initialization

ESA is a way to convert a given text to a weighted vector of wikipedia articles. ESA uses the text of the article body instead of anchor text or the article titles to compute the word to concept weights. A TF-IDF scheme is used for this. ESA scheme was successfully used in this paper to compute semantic relatedness and hence is a good teleport vector for the given text.

Related papers

There has been a lot of work on anomaly detection in graphs.

  • The paper by Moonesinghe and Tan ICTAI06 finds the clusters of outlier objects by doing random walk on the weighted graph.
  • The paper by Aggarwal SIGMOD 2001 proposes techniques for projecting high dimensional data on lower dimensions to detect outliers.

Study plan

  • Article:Bipartite graph:[1]
  • Article:Anomaly detection:[2]
  • Paper:Topic sensitive pagerank:[3]
    • Paper:The PageRank Citation Ranking: Bringing Order to the Web:[4]
  • Paper:Multilevel k-way Partitioning Scheme for Irregular Graphs:[5]