TREC Blogs06 dataset
The TREC Blogs06 Collection: Creating and Analysing a Blog Test Collection
Macdonald, C., and Ounis, I. (2006) The TREC Blogs06 Collection: Creating and Analysing a Blog Test Collection. Technical Report. Dept of Computing Science, University of Glasgow.
Full text not currently available from Enlighten.
Abstract The explosion of blogs on the Web in recent years has fostered research interest in the Information Retrieval (IR) and other communities into the properties of the so-called `blogsphere'. However, without any standard test collection available, research has been restricted to unshared collections collected by individual research groups. With the advent of the Blog Track running at TREC 2006, there was a need to create a test collection of blog data, that could be shared among participants and form the backbone of the experiments. Such a collection should be a realistic snapshot of the blogsphere, of enough blogs as to have recognisable properties of the blogsphere, and over a long enough time period that events should be recognisable. In addition, the collection should exhibit other properties of the blogsphere, such as splogs and comment spam. This paper describes the creation of the Blogs06 collection by the University of Glasgow, and reports statistics of the collected data. Moreover, we demonstrate how some characteristics of the collection vary across the spam and non-spam components of the collection.
data can be found here