Yeh et al WikiWalk Random walks on Wikipedia for Semantic Relatedness

From Cohen Courses
Jump to navigationJump to search

This is a Paper summarized for the course Analysis of Social Media 10-802 in Fall 2012.

Citation

E. Yeh, D. Ramage, C. D. Manning, E. Agirre, and A. Soroa, “Wikiwalk: random walks on wikipedia for semantic relatedness,” in Proceedings of TextGraphs-4 (2009 Workshop on Graph-based Methods for Natural Language Processing), Singapore, 2009, pp. 41–49.

Online version

Wikiwalk: Random Walks on Wikipedia for Semantic Relatedness

Summary

In this paper, the authors evaluate a method to find semantic relatedness between 2 pieces of short text / terms using the knowledge base of wikipedia. Based on the previous works, the paper aims to make use of both the content of wikipedia articles and the structure of links between different articles to compute the semantic similarity between the 2 input texts. The crux of the method is to create a customized teleport vector for the input texts and use it for personalized page rank. The resulting static distribution for each text is then compared using cosine similarity.

Important concepts from previous works

The following ideas are used in developing a way of measuring semantic relatedness:

  1. Personalized Page Rank: One way of viewing [page rank] (paper) is the probability that a random surfer will land on a page after following clicks from different pages. However, there can be sinks in the graph. For example, 2 nodes pointing to each other and no other node would result in the random surfer just oscillating between the 2 nodes once he lands on any of the node. In this case computing page rank won't be possible. Hence, in the original page rank paper, the authors also defined a term which models a jump to random page while performing the walk in the graph. This term was taken as uniform for all pages in the original page rank algorithm. However, in on of the later works, Topic Sensitive Page Rank, instead of assuming the random jump vector (teleport vector) for all nodes to be uniform, they are personalized based on the topic or the context. This helps in measuring the importance of pages with respect to different topics.
  2. Explicit Semantic Analysis: In the paper for [Explicit Semantic Analysis], the authors provide a way for 'wikifying' a given piece of text. Wikifying is essentially replacing the words in the text with wikipedia article titles (also called as wikipedia concepts). ESA maps the input text to a weighted vector of Wikipedia articles where the weights between words and concepts are calculated using a TF-IDF approach.

Main Ideas

To perform a random walk on wikipedia with personalization, there are following 2 essential components:

  1. How to construct wikipedia graph
  2. How to construct the teleport vector.

Constructing Wikipedia Graph

The obvious way of creating a graph is to consider nodes as articles and hyperlinks as links in the graph. This approach has been used by many different papers and the data is available either in DBPedia or Wikipedia-Miner. The paper uses this to create the graph but distinguishes between infobox links, category links and links in the article text when performing experiments (Considering different combination on link types as edges)

Links are also factored out based on the concept of generality introduced in this paper. There are some techniques mentioned to prune the graph as well like ignoring articles with very few stop words and very few links.

Teleportation Vector

Teleport vector represents the initial vector of mass over articles for a given input text. The paper discusses 2 ways to initialize these the teleport vector.

Dictionary Based Initialization

Different ways of creating a dictionary from words to wikipedia articles are explored in the paper. To summarize, some of them are article title directly, redirection pages and disambiguation pages, anchors with different constraints on the percentage of anchors, etc.

ESA based initialization

ESA is a way to convert a given text to a weighted vector of wikipedia articles. ESA uses the text of the article body instead of anchor text or the article titles to compute the word to concept weights. A TF-IDF scheme is used for this. ESA scheme was successfully used in this paper to compute semantic relatedness and hence is a good teleport vector for the given text.

Experiments

Many experiments were conducted using different strategies for building wikipedia graph and the teleportation vector. The paper measures the performance of these experiments on standard datasets used for computing semantic relatedness (WordSim 353, Miller Charles dataset and Lee et. al. dataset

DataSets / Resources

Related papers


Study plan