Roja Bandari et. al. ICWSM 2012
Contents
Citation
R Bandari, S Asur, BA Huberman The Pulse of News in Social Media: Forecasting Popularity, ICWSM 2012
Summary
In this paper, the author address the following problem: predict the popularity of news prior to their release. They extract features from article based on its content, using two methods to predict their popularity: regression and classification, and evaluate with the actual popularity from social media, like Twitter.
Datasets
They collected all news article, from August 8th to 16th using API of a news aggregator called FeedZilla. Each article include a title, short summary, url, and a timestamp, and a category. The total number of data after cleaning is over 42,000.
They then using a service called Topsy, to collect the times being posted and retweeted on Twiiter for each new article.
Features
- News Category
Like technology and entertainment. They define a measure called t-density, which means the average number of tweets contains links from a particular category, to define the popularity of such category.
- Subjectivity
They use an existing NLP toolkit to assign a binary score to an article, whether the article is subjective or objective.
- Name Entity
They use t-density to measure the popularity of all the name entities. Then for a given article, they calculate score, according to name entity extracted from the article.
- The News Source
The 't-density is also used in measuring the popularity of a news source, like weblogs.
Prediction
The author use to method to predict the popularity of a given article:
- Regression
They perform 3 kinds of regression: linear regression, SVM regression, and kNN regression, to predict the exact number of tweets, given a news article. The results turns out to be not so satisfied, the R^{2} value is typically below 0.5.
- Classification
They define the popularity of a news into 3 classes: A(1-20 tweets), B(20-100 tweets),C(more than 100 tweets). They use serval classifying algorithm to perform classification, the best result is about 84%.
Discussion
The authors find the following interesting aspects from the experiment:
1. The traditional popular news agencies are not necessarily the most popular new sources on twitter. They calculate the t-density value for the traditional famous new agencies like Wall Street Journal, and find they are not so popular as some tech blogs, like Mashable and Google Blogs.
2. The most significant predictor is the source of the news.
3. The category feature doesn't perform will. They guess the category labeled by FeedZilla is not really accurate.
Related Papers
[Leskovec, Backstrom, and Kleinberg 2009] Leskovec, J.; Backstrom, L.; and Kleinberg, J. M. 2009. Meme-tracking and the dynamics of the news cycle. In KDD, 497–506. ACM.
[Lerman and Ghosh 2010] Lerman, K., and Ghosh, R. 2010. Information contagion: An empirical study of the spread of news on digg and twitter social networks. In ICWSM. The AAAI Press
Involved Toolkits
[Alias-i. 2008] Alias-i. 2008. Lingpipe 4.1.0. [1] for subjectivity detection
Name Entity Extractor [2]