Difference between revisions of "Roja Bandari et. al. ICWSM 2012"

From Cohen Courses
Jump to navigationJump to search
Line 17: Line 17:
 
== Features ==
 
== Features ==
  
  - The category of the article.
+
  - '''The category of the article'''
 
Like technology and entertainment. They define a measure called ''t-density'', which means the average tweets of link in a particular category, to define the popularity of such category.
 
Like technology and entertainment. They define a measure called ''t-density'', which means the average tweets of link in a particular category, to define the popularity of such category.
  - Subjectivity of the article.  
+
  - '''Subjectivity of the article.''
 
They use an existing NLP toolkit to assign a binary score to an article, whether the article is subjective or objective.
 
They use an existing NLP toolkit to assign a binary score to an article, whether the article is subjective or objective.
  - Name Entity.
+
  - '''Name Entity'''
 
They use ''t-density'' to measure the popularity of all the name entities. Then for a given article, they calculate score, according to name entity extracted from the article.
 
They use ''t-density'' to measure the popularity of all the name entities. Then for a given article, they calculate score, according to name entity extracted from the article.
  - The Source of a news
+
  - '''The Source of a news'''
 
The '''t-density'' is also used in measuring the popularity of a news source, like weblogs.
 
The '''t-density'' is also used in measuring the popularity of a news source, like weblogs.
  

Revision as of 21:22, 26 September 2012

Citation

R Bandari, S Asur, BA Huberman The Pulse of News in Social Media: Forecasting Popularity, ICWSM 2012


Summary

In this paper, the author address the following problem: predict the popularity of news prior to their release. They extract features from article based on its content, using two methods to predict their popularity: regression and classification, and evaluate with the actual popularity from social media, like Twitter.

Datasets

They collected all news article, from August 8th to 16th using API of a news aggregator called FeedZilla. Each article include a title, short summary, url, and a timestamp, and a category. The total number of data after cleaning is over 42,000.

They then using a service called Topsy, to collect the times being posted and retweeted on Twiiter for each new article.

Features

- The category of the article

Like technology and entertainment. They define a measure called t-density, which means the average tweets of link in a particular category, to define the popularity of such category.

- 'Subjectivity of the article. 

They use an existing NLP toolkit to assign a binary score to an article, whether the article is subjective or objective.

- Name Entity 

They use t-density to measure the popularity of all the name entities. Then for a given article, they calculate score, according to name entity extracted from the article.

- The Source of a news 

The 't-density is also used in measuring the popularity of a news source, like weblogs.

Prediction

The author use to method to predict the popularity of a given article:

- Regression. They perform 3 kinds of regression: linear regression, SVM regression, and kNN regression, to predict the exact number of tweets, given a news article. The results turns out to be not so satisfied, the R^{2} value is typically below 0.5. 
- Classification. They define the popularity of a news into 3 classes: A(1-20 tweets), B(20-100 tweets),C(more than 100 tweets). They use serval classifying algorithm to perform classification, the best result is about 84%.