Roja Bandari et. al. ICWSM 2012

From Cohen Courses
Jump to navigationJump to search

Citation

R Bandari, S Asur, BA Huberman The Pulse of News in Social Media: Forecasting Popularity, ICWSM 2012


Summary

In this paper, the author address the following problem: predict the popularity of news prior to their release. They extract features from article based on its content, using two methods to predict their popularity: regression and classification, and evaluate with the actual popularity from social media, like Twitter.

Datasets

They collected all news article, from August 8th to 16th using API of a news aggregator called FeedZilla. Each article include a title, short summary, url, and a timestamp, and a category. The total number of data after cleaning is over 42,000.

They then using a service called Topsy, to collect the times being posted and retweeted on Twiiter for each new article.

Features

- News Category

Like technology and entertainment. They define a measure called t-density, which means the average number of tweets contains links from a particular category, to define the popularity of such category.

- Subjectivity 

They use an existing NLP toolkit to assign a binary score to an article, whether the article is subjective or objective.

- Name Entity 

They use t-density to measure the popularity of all the name entities. Then for a given article, they calculate score, according to name entity extracted from the article.

- The News Source 

The 't-density is also used in measuring the popularity of a news source, like weblogs.

Prediction

The author use to method to predict the popularity of a given article:

- Regression

They perform 3 kinds of regression: linear regression, SVM regression, and kNN regression, to predict the exact number of tweets, given a news article. The results turns out to be not so satisfied, the R^{2} value is typically below 0.5.

- Classification

They define the popularity of a news into 3 classes: A(1-20 tweets), B(20-100 tweets),C(more than 100 tweets). They use serval classifying algorithm to perform classification, the best result is about 84%.

Discussion

The authors find the following interesting aspects from the experiment:

1. The traditional popular news agencies are not necessarily the most popular new sources on twitter. They calculate the t-density value for the traditional famous new agencies like Wall Street Journal, and find they are not so popular as some tech blogs, like Mashable and Google Blogs.

2. The most significant predictor is the source of the news.

3. The category feature doesn't perform will. They guess the category labeled by FeedZilla is not really accurate.

Related Papers

[Leskovec, Backstrom, and Kleinberg 2009] Leskovec, J.; Backstrom, L.; and Kleinberg, J. M. 2009. Meme-tracking and the dynamics of the news cycle. In KDD, 497–506. ACM.

[Lerman and Ghosh 2010] Lerman, K., and Ghosh, R. 2010. Information contagion: An empirical study of the spread of news on digg and twitter social networks. In ICWSM. The AAAI Press


Involved Toolkits

[Alias-i. 2008] Alias-i. 2008. Lingpipe 4.1.0. [1] for subjectivity detection

Name Entity Extractor [2]