Difference between revisions of "Hassan et al, ICWSM 2009"

From Cohen Courses
Jump to navigationJump to search
 
(2 intermediate revisions by the same user not shown)
Line 9: Line 9:
 
== Summary ==
 
== Summary ==
  
The aim of this [[Category::paper]] is to find the important and influential blogs with recurring interest in a specific topic. Given a set of blogs related to a particular topic, the authors are trying to find a subset of blogs that represents the larger set.  
+
The aim of this [[Category::paper]] is to find the important and influential blogs with recurring interest in a specific topic. Given a set of blogs related to a particular topic, the authors are trying to find a subset of blogs that represents the larger set by using a stochastic graph based method.  
  
The authors approach to this [[AddressesProblem::blog retrieval]] problem with the assumption that important and representative blogs tend to be lexically similar to other important and representative blogs. Therefore they used textual similarity between posts as a way to understand which blog is affecting the others and so to determine the authorities.
+
The authors approached to this [[AddressesProblem::blog retrieval]] problem with the assumption that important and representative blogs tend to be lexically similar to other important and representative blogs. Therefore they used textual similarity between posts as a way to understand which blog is affecting the others and so to determine the authorities.
  
The authors used a [[UsesMethod::PageRank]] like algorithm to rank the blogs by their popularity. In their algorithm they represent each blog with a node and put an edge between two nodes if they are lexically similar. Iterations over this graph will calculate the importance score of a blog by using the scores of its neighbors.   
+
The authors used a [[UsesMethod::PageRank]] like algorithm, called BlogRank, to rank the blogs by their popularity. In their algorithm they represented each blog with a node and put an edge between two nodes if they are lexically similar. Iterations over this graph calculates the importance score of a blog by using the scores of its neighbors.  
 +
    
 +
[[File:BlogRank.jpg]]
  
[[UsesMethod::Cosine similarity]] between tf-idf vector representations of posts are used the calculate the text similarity between posts.  
+
[[UsesMethod::Cosine similarity]] between tf-idf vector representations of posts are used the calculate the text similarity between posts. The authors also used blog related attributes such as number of posts, average length of posts etc. as priors. BlogRank algorithm takes diversity into account and penalize blogs that are quite similar to already selected blogs.
  
[[UsesDataset::TREC BLOG06]] dataset has been used in the experiments. They used [[UsesMethod::diffusion models]] to measure the performance of their algorithm.  
+
[[UsesDataset::TREC BLOG06]] and [[UsesDataset::UCLA Blogocenter]] datasets had been used in the experiments. They used [[UsesMethod::diffusion models]] to measure the performance of their algorithm. Initially they marked the selected nodes as active and then applied the diffusion model and counted the number of activated nodes at the end.
 +
 +
The authors tried several other algorithms to compare with their ranking algorithm. The experiments showed that BlogRank outperforms other methods both in coverage and in running time. They also performed experiments in order to see whether BlogRank algorithm can be used in predicting. The results indicated that BlogRank method generalizes well for the future. 
  
 
+
This work is similar to the Blog Distillation task in the TREC Blog Track. However in blog distillation task, given a query the aim is to return all relevant blogs. In this paper, given set of blogs related to topic, the aim is to select smaller set of blogs. Some related works are [[RelatedPaper::Arguello et al, ICWSM 2008]] and [[RelatedPaper::Elsas et al, TREC 2007]].
 
 
In addition to detecting these changes, the authors also propose a tf-idf based scoring method to represent the breakpoints. They find the keywords by looking at the tfidf of the words while making sure that a word from the current time do not increase the prominence of the same word from an older time period.
 
 
 
The authors report the analysis of Tiger Wood's car accident topic in 2009. They found several possible breaks within the tweets and some of them are related to the events from reported news. They were also able to produce prominent words that describes the breakpoint.
 
 
 
Related to the paper, the authors produce a [http://upinion.cse.buffalo.edu/beta/index.php news tracking application]on Twitter where a user can click on a period to see the events of the period with related prominent words. 
 
 
 
A related work [[RelatedPaper::Ku et al, AAAI 2006]] also focused on identifiying temporal changes in opinion by using language characteristics of Chinese.
 

Latest revision as of 01:59, 31 March 2011

Citation

Ahmed Hassan, Dragomir R. Radev, Junghoo Cho, Amruta Joshi. 2009. Content Based Recommendation and Summarization in the Blogosphere. The International Conference on Weblogs and Social Media (ICWSM 2009).

Online version

ICWSM09

Summary

The aim of this paper is to find the important and influential blogs with recurring interest in a specific topic. Given a set of blogs related to a particular topic, the authors are trying to find a subset of blogs that represents the larger set by using a stochastic graph based method.

The authors approached to this blog retrieval problem with the assumption that important and representative blogs tend to be lexically similar to other important and representative blogs. Therefore they used textual similarity between posts as a way to understand which blog is affecting the others and so to determine the authorities.

The authors used a PageRank like algorithm, called BlogRank, to rank the blogs by their popularity. In their algorithm they represented each blog with a node and put an edge between two nodes if they are lexically similar. Iterations over this graph calculates the importance score of a blog by using the scores of its neighbors.

BlogRank.jpg

Cosine similarity between tf-idf vector representations of posts are used the calculate the text similarity between posts. The authors also used blog related attributes such as number of posts, average length of posts etc. as priors. BlogRank algorithm takes diversity into account and penalize blogs that are quite similar to already selected blogs.

TREC BLOG06 and UCLA Blogocenter datasets had been used in the experiments. They used diffusion models to measure the performance of their algorithm. Initially they marked the selected nodes as active and then applied the diffusion model and counted the number of activated nodes at the end.

The authors tried several other algorithms to compare with their ranking algorithm. The experiments showed that BlogRank outperforms other methods both in coverage and in running time. They also performed experiments in order to see whether BlogRank algorithm can be used in predicting. The results indicated that BlogRank method generalizes well for the future.

This work is similar to the Blog Distillation task in the TREC Blog Track. However in blog distillation task, given a query the aim is to return all relevant blogs. In this paper, given set of blogs related to topic, the aim is to select smaller set of blogs. Some related works are Arguello et al, ICWSM 2008 and Elsas et al, TREC 2007.