Heymann Dataset

From Cohen Courses
Revision as of 02:41, 20 April 2010 by PastStudents (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Heymann dataset is obtained by crawling [del.icio.us] Website from September 2006 to end of July 2007. The dataset contains three parts:

- Dataset C(rawl): the dataset is built by a breadth-first search from the tag "web". The crawling process is described in paper:Can Social Bookmarking Improve Web Search?. The dataset consists of 22,588,354 posts and 1,371,941 unique URLs.

- Dataset R(ecent): this database is built by crawling "recent feed" part of [del.icio.us] for 8 months starting from September 28th, 2006. Dataset contains 11,613,913 posts and 3,00,4,998 unique URLs.

- Dataset M(onth): this dataset is built by continuously crawling recent feeds of [del.icio.us]. It contains 3,630,250 posts and 2,549,282 unique URLs.

Link: Paul Heymann Web-page, owner of the above datasets.

Relevant Papers