Difference between revisions of "Yeh et al WikiWalk Random walks on Wikipedia for Semantic Relatedness"

From Cohen Courses
Jump to navigationJump to search
(Created page with 'This is a [[Category::Paper]] summarized for the course Analysis of Social Media 10-802 in Fall 2012. NOTE: WORK IN PROGRESS! == Citation == E. Yeh, D. Ramage, C. D. Manning, …')
 
 
(8 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
This is a [[Category::Paper]] summarized for the course Analysis of Social Media 10-802 in Fall 2012.  
 
This is a [[Category::Paper]] summarized for the course Analysis of Social Media 10-802 in Fall 2012.  
 
NOTE: WORK IN PROGRESS!
 
  
 
== Citation ==
 
== Citation ==
Line 13: Line 11:
 
In this paper, the authors evaluate a method to find semantic relatedness between 2 pieces of short text / terms using the knowledge base of wikipedia. Based on the previous works, the paper aims to make use of both the content of wikipedia articles and the structure of links between different articles to compute the semantic similarity between the 2 input texts. The crux of the method is to create a customized teleport vector for the input texts and use it for personalized page rank. The resulting static distribution for each text is then compared using cosine similarity.
 
In this paper, the authors evaluate a method to find semantic relatedness between 2 pieces of short text / terms using the knowledge base of wikipedia. Based on the previous works, the paper aims to make use of both the content of wikipedia articles and the structure of links between different articles to compute the semantic similarity between the 2 input texts. The crux of the method is to create a customized teleport vector for the input texts and use it for personalized page rank. The resulting static distribution for each text is then compared using cosine similarity.
  
== Evaluation ==
+
== Important concepts from previous works ==
 +
The following ideas are used in developing a way of measuring semantic relatedness:
 +
# Personalized Page Rank: One way of viewing [[http://en.wikipedia.org/wiki/PageRank page rank]] ([http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf paper]) is the probability that a random surfer will land on a page after following clicks from different pages. However, there can be sinks in the graph. For example, 2 nodes pointing to each other and no other node would result in the random surfer just oscillating between the 2 nodes once he lands on any of the node. In this case computing page rank won't be possible. Hence, in the original page rank paper, the authors also defined a term which models a jump to random page while performing the walk in the graph. This term was taken as uniform for all pages in the original page rank algorithm. However, in on of the later works, [http://www-cs-students.stanford.edu/~taherh/papers/topic-sensitive-pagerank.pdf Topic Sensitive Page Rank], instead of assuming the random jump vector (teleport vector) for all nodes to be uniform, they are personalized based on the topic or the context. This helps in measuring the importance of pages with respect to different topics.
 +
# Explicit Semantic Analysis: In the paper for [[http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-259.pdf Explicit Semantic Analysis]], the authors provide a way for 'wikifying' a given piece of text. Wikifying is essentially replacing the words in the text with wikipedia article titles (also called as wikipedia concepts). ESA maps the input text to a weighted vector of Wikipedia articles where the weights between words and concepts are calculated using a TF-IDF approach.
 +
 
 +
== Main Ideas ==
 +
To perform a random walk on wikipedia with personalization, there are following 2 essential components:
 +
# How to construct wikipedia graph
 +
# How to construct the teleport vector.
 +
 +
===Constructing Wikipedia Graph===
 +
The obvious way of creating a graph is to consider nodes as articles and hyperlinks as links in the graph. This approach has been used by many different papers and the data is available either in [http://wiki.dbpedia.org/Datasets DBPedia] or [http://wikipedia-miner.cms.waikato.ac.nz/ Wikipedia-Miner]. The paper uses this to create the graph but distinguishes between [http://en.wikipedia.org/wiki/Help:Infobox infobox] links, [http://en.wikipedia.org/wiki/Help:Category category] links and links in the article text when performing experiments (Considering different combination on link types as edges)
  
== Related Work ==
+
Links are also factored out based on the concept of generality introduced in this [http://www.jair.org/media/2669/live-2669-4346-jair.pdf paper]. There are some techniques mentioned to prune the graph as well like ignoring articles with very few stop words and very few links.
The paper borrows ideas from many different papers in developing a way of measuring semantic relatedness. A few notable ideas used in the paper are:
 
# Personalized Page Rank: One way of viewing [[http://en.wikipedia.org/wiki/PageRank page rank]] ([http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf paper]) is the probability that a random surfer will land on a page after following clicks from different pages. However, there can be sinks in the graph. For example, 2 nodes pointing to each other and no other node would result in the random surfer just oscillating between the 2 nodes once he lands on any of the node. In this case computing page rank won't be possible. Hence, in the original page rank paper, the authors also defined a term which models a jump to random page while performing the walk in the graph. This term was taken as uniform for all pages in the original page rank algorithm. However, in the later work, [[Topic Sensitive Page Rank]], instead of taking the random jump vector for pages to be uniform, they are personalized based on the topic or the context. This helps in measuring the importance of pages with respect to different topics.
 
# Explicit Semantic Analysis: In the paper for [[http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-259.pdf Explicit Semantic Analysis]], the authors provide a way for 'wikifying' a given piece of text. Wikifying is essentially replacing the words in the text with wikipedia article titles (also called as wikipedia concepts). ESA maps the input text to a weighted vector of Wikipedia articles where the weights between words and concepts are calculated using a TF-IDF approach.
 
  
 +
=== Teleportation Vector ===
 +
Teleport vector represents the initial vector of mass over articles for a given input text. The paper discusses 2 ways to initialize these the teleport vector.
 +
==== Dictionary Based Initialization ====
 +
Different ways of creating a dictionary from words to wikipedia articles are explored in the paper. To summarize, some of them are article title directly, redirection pages and disambiguation pages, [http://en.wikipedia.org/wiki/Help:Anchor#Section_linking_.28anchors.29 anchors] with different constraints on the percentage of anchors, etc.
  
== Main Ideas ==
+
==== ESA based initialization ====
This paper poses two important social problems related to bipartite social graphs and explained how those problems can be solved efficiently using random walks.
+
ESA is a way to convert a given text to a weighted vector of wikipedia articles. ESA uses the text of the article body instead of anchor text or the article titles to compute the word to concept weights. A TF-IDF scheme is used for this. ESA scheme was successfully used in this [http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-259.pdf paper] to compute semantic relatedness and hence is a good teleport vector for the given text.
  
They also claim that the neighborhoods over nodes can represent personalized clusters depending on different perspectives.
+
===Experiments===
 +
Many experiments were conducted using different strategies for building wikipedia graph and the teleportation vector. The paper measures the performance of these experiments on standard datasets used for computing semantic relatedness ([http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/ WordSim 353], Miller Charles dataset and [http://digital.library.adelaide.edu.au/dspace/bitstream/2440/28910/1/hdl_28910.pdf Lee et. al. dataset]
  
During presentation one of the audiences raised question about is anomaly detection in this paper similar to betweenness of edges defined in Kleinber's text as discussed in [[Class Meeting for 10-802 01/26/2010]]. I think they are similar. In the texbook they propose, detecting edges with high betweenness and using them to partition the graph. In this paper they first try to create neighbourhood partitions based on random walk prbabilities and which as a by product gives us nodes and edges with high betweenness value.
+
== DataSets / Resources ==
 +
* [http://malt.ml.cmu.edu/mw/index.php/DBpedia DBPedia]
 +
* [http://malt.ml.cmu.edu/mw/index.php/Wikipedia_Miner Wikipedia-Miner]
 +
* [http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/ WordSim 353]
 +
* Miller Charles dataset (included in WordSim)
 +
* [http://digital.library.adelaide.edu.au/dspace/bitstream/2440/28910/1/hdl_28910.pdf Lee et. al. dataset]
  
 
== Related papers ==
 
== Related papers ==
There has been a lot of work on anomaly detection in graphs.
+
* [http://www-cs-students.stanford.edu/~taherh/papers/topic-sensitive-pagerank.pdf Topic Sensitive Page Rank]
* The paper by [[RelatedPaper::Moonesinghe and Tan ICTAI06]] finds the clusters of outlier objects by doing random walk on the weighted graph.
+
* [http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-259.pdf Explicit Semantic Analysis]
* The paper by [[RelatedPaper::Aggarwal SIGMOD 2001]] proposes techniques for projecting high dimensional data on lower dimensions to detect outliers.
+
* [http://www.aaai.org/Papers/Workshops/2008/WS-08-15/WS08-15-005.pdf An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links]
 +
* [ftp://cs.pitt.edu/pub/web/projects/nlp/conf/aaai2006/12/AAAI06-223.pdf Wikirelate! computing semantic relatedness using Wikipedia]
 +
* [http://www.aclweb.org/anthology/E/E09/E09-1005.pdf Personalizing pagerank for word sense disambiguation]
 +
 
 +
 
  
 
== Study plan ==
 
== Study plan ==
* Article:Bipartite graph:[http://en.wikipedia.org/wiki/Bipartite_graph]
+
* Article: [http://en.wikipedia.org/wiki/PageRank Page Rank]
* Article:Anomaly detection:[http://en.wikipedia.org/wiki/Anomaly_detection]
+
* Paper: [http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf Page Rank paper] (Personalized Page Rank Section)
* Paper:Topic sensitive pagerank:[http://dl.acm.org/citation.cfm?id=511513]
+
* Paper: [http://www-cs-students.stanford.edu/~taherh/papers/topic-sensitive-pagerank.pdf Topic Sensitive Page Rank]
**Paper:The PageRank Citation Ranking: Bringing Order to the Web:[http://ilpubs.stanford.edu:8090/422/]
+
* Paper: [http://www.aaai.org/Papers/IJCAI/2007/IJCAI07-259.pdf Explicit Semantic Analysis]
* Paper:Multilevel k-way Partitioning Scheme for Irregular Graphs:[http://glaros.dtc.umn.edu/gkhome/node/81]
 

Latest revision as of 08:36, 4 October 2012

This is a Paper summarized for the course Analysis of Social Media 10-802 in Fall 2012.

Citation

E. Yeh, D. Ramage, C. D. Manning, E. Agirre, and A. Soroa, “Wikiwalk: random walks on wikipedia for semantic relatedness,” in Proceedings of TextGraphs-4 (2009 Workshop on Graph-based Methods for Natural Language Processing), Singapore, 2009, pp. 41–49.

Online version

Wikiwalk: Random Walks on Wikipedia for Semantic Relatedness

Summary

In this paper, the authors evaluate a method to find semantic relatedness between 2 pieces of short text / terms using the knowledge base of wikipedia. Based on the previous works, the paper aims to make use of both the content of wikipedia articles and the structure of links between different articles to compute the semantic similarity between the 2 input texts. The crux of the method is to create a customized teleport vector for the input texts and use it for personalized page rank. The resulting static distribution for each text is then compared using cosine similarity.

Important concepts from previous works

The following ideas are used in developing a way of measuring semantic relatedness:

  1. Personalized Page Rank: One way of viewing [page rank] (paper) is the probability that a random surfer will land on a page after following clicks from different pages. However, there can be sinks in the graph. For example, 2 nodes pointing to each other and no other node would result in the random surfer just oscillating between the 2 nodes once he lands on any of the node. In this case computing page rank won't be possible. Hence, in the original page rank paper, the authors also defined a term which models a jump to random page while performing the walk in the graph. This term was taken as uniform for all pages in the original page rank algorithm. However, in on of the later works, Topic Sensitive Page Rank, instead of assuming the random jump vector (teleport vector) for all nodes to be uniform, they are personalized based on the topic or the context. This helps in measuring the importance of pages with respect to different topics.
  2. Explicit Semantic Analysis: In the paper for [Explicit Semantic Analysis], the authors provide a way for 'wikifying' a given piece of text. Wikifying is essentially replacing the words in the text with wikipedia article titles (also called as wikipedia concepts). ESA maps the input text to a weighted vector of Wikipedia articles where the weights between words and concepts are calculated using a TF-IDF approach.

Main Ideas

To perform a random walk on wikipedia with personalization, there are following 2 essential components:

  1. How to construct wikipedia graph
  2. How to construct the teleport vector.

Constructing Wikipedia Graph

The obvious way of creating a graph is to consider nodes as articles and hyperlinks as links in the graph. This approach has been used by many different papers and the data is available either in DBPedia or Wikipedia-Miner. The paper uses this to create the graph but distinguishes between infobox links, category links and links in the article text when performing experiments (Considering different combination on link types as edges)

Links are also factored out based on the concept of generality introduced in this paper. There are some techniques mentioned to prune the graph as well like ignoring articles with very few stop words and very few links.

Teleportation Vector

Teleport vector represents the initial vector of mass over articles for a given input text. The paper discusses 2 ways to initialize these the teleport vector.

Dictionary Based Initialization

Different ways of creating a dictionary from words to wikipedia articles are explored in the paper. To summarize, some of them are article title directly, redirection pages and disambiguation pages, anchors with different constraints on the percentage of anchors, etc.

ESA based initialization

ESA is a way to convert a given text to a weighted vector of wikipedia articles. ESA uses the text of the article body instead of anchor text or the article titles to compute the word to concept weights. A TF-IDF scheme is used for this. ESA scheme was successfully used in this paper to compute semantic relatedness and hence is a good teleport vector for the given text.

Experiments

Many experiments were conducted using different strategies for building wikipedia graph and the teleportation vector. The paper measures the performance of these experiments on standard datasets used for computing semantic relatedness (WordSim 353, Miller Charles dataset and Lee et. al. dataset

DataSets / Resources

Related papers


Study plan