Heymann dataset is obtained by crawling [del.icio.us] Website from September 2006 to end of July 2007. The dataset contains three parts:
- Dataset C(rawl): the dataset is built by a breadth-first search from the tag "web". The crawling process is described in paper:Can Social Bookmarking Improve Web Search?. The dataset consists of 22,588,354 posts and 1,371,941 unique URLs.
- Dataset R(ecent): this database is built by crawling "recent feed" part of [del.icio.us] for 8 months starting from September 28th, 2006. Dataset contains 11,613,913 posts and 3,00,4,998 unique URLs.
- Dataset M(onth): this dataset is built by continuously crawling recent feeds of [del.icio.us]. It contains 3,630,250 posts and 2,549,282 unique URLs.
Link: Paul Heymann Web-page, owner of the above datasets.