Difference between revisions of "Leskovec, Backstrom and Kleinberg KDD 09 News and Blog dataset"

From Cohen Courses
Jump to navigationJump to search
Line 1: Line 1:
 
== Summary ==
 
== Summary ==
 
This is a description of a dataset of news and blog articles that is first mentioned in [[ Leskovec, J., L. Backstrom, and J. Kleinberg. 2009. Meme-tracking and the Dynamics of the News Cycle. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 497–506. ]]
 
This is a description of a dataset of news and blog articles that is first mentioned in [[ Leskovec, J., L. Backstrom, and J. Kleinberg. 2009. Meme-tracking and the Dynamics of the News Cycle. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 497–506. ]]
 +
 +
== Description ==
 +
The dataset covers three months of online mainstream and social media activity from August 1 to October 31 2008 with about 1 million documents per day. In total it consist of 90 million documents (blog posts and news articles) from 1.65 million different sites obtained through the Spinn3r API. The total dataset size is 390GB and essentially includes complete online media coverage, i.e. all mainstream media sites that are part of Google News (20,000 different sites) plus 1.6 million blogs, forums and other media sites.
 +
 +
From the dataset the authors extract a total of 112 million quotes and discarded those with less than 4 words or a frequency less than 10. This leaves 47 million phrases out of which 22 million are distinct. The authors cluster these phrases using the method outlined in their paper to produce a DAG with 35,800 non-trivial components (clusters with at least two phrases) that together include 94,700 nodes (phrases).
  
 
== Relevant Papers ==
 
== Relevant Papers ==

Revision as of 18:18, 22 April 2011

Summary

This is a description of a dataset of news and blog articles that is first mentioned in Leskovec, J., L. Backstrom, and J. Kleinberg. 2009. Meme-tracking and the Dynamics of the News Cycle. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 497–506.

Description

The dataset covers three months of online mainstream and social media activity from August 1 to October 31 2008 with about 1 million documents per day. In total it consist of 90 million documents (blog posts and news articles) from 1.65 million different sites obtained through the Spinn3r API. The total dataset size is 390GB and essentially includes complete online media coverage, i.e. all mainstream media sites that are part of Google News (20,000 different sites) plus 1.6 million blogs, forums and other media sites.

From the dataset the authors extract a total of 112 million quotes and discarded those with less than 4 words or a frequency less than 10. This leaves 47 million phrases out of which 22 million are distinct. The authors cluster these phrases using the method outlined in their paper to produce a DAG with 35,800 non-trivial components (clusters with at least two phrases) that together include 94,700 nodes (phrases).

Relevant Papers