Leskovec, Backstrom and Kleinberg KDD 09 News and Blog dataset
This is a description of a dataset of news and blog articles that is first mentioned in Leskovec, J., L. Backstrom, and J. Kleinberg. 2009. Meme-tracking and the Dynamics of the News Cycle. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 497–506.
The dataset covers three months of online mainstream and social media activity from August 1 to October 31 2008 with about 1 million documents per day. In total it consist of 90 million documents (blog posts and news articles) from 1.65 million different sites obtained through the Spinn3r API. The total dataset size is 390GB and essentially includes complete online media coverage, i.e. all mainstream media sites that are part of Google News (20,000 different sites) plus 1.6 million blogs, forums and other media sites.
From the dataset the authors extract a total of 112 million quotes and discarded those with less than 4 words or a frequency less than 10. This leaves 47 million phrases out of which 22 million are distinct. The authors cluster these phrases using the method outlined in their paper to produce a DAG with 35,800 non-trivial components (clusters with at least two phrases) that together include 94,700 nodes (phrases).
The authors also label each off our 1.6 million sites as news media or blogs, using the following rule: if a site appears on Google News then it is labeled as news media, and otherwise as a blog. There are only 20,000 different news sites in Google News, which a tiny number when compared to 1.65 million the authors track, but they find that these news sites generate about 30% of the total number of documents in our dataset.