Difference between revisions of "Standard Citation Datasets"
From Cohen Courses
Jump to navigationJump to search (Created page with 'Following are two datasets considered as standards to be used for the problem of Citation Matching:Citation Matching. In citation matching, a cluster is a set of citations th…') |
|||
(5 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | Following are two datasets considered as standards to be used for the problem of [[Citation Matching | + | Following are two datasets considered as standards to be used for the problem of [[Citation Matching|Citation Matching]]. |
In citation matching, a cluster is a set of citations that refer to the same paper, and a nontrivial cluster contains more | In citation matching, a cluster is a set of citations that refer to the same paper, and a nontrivial cluster contains more | ||
than one citation. | than one citation. |
Latest revision as of 04:08, 7 December 2011
Following are two datasets considered as standards to be used for the problem of Citation Matching. In citation matching, a cluster is a set of citations that refer to the same paper, and a nontrivial cluster contains more than one citation.
CiteSeer Dataset
The CiteSeer dataset has 1563 citations and 906 clusters. Contains four sections, each on a different topic. Over two-thirds of the clusters are singletons; largest cluster has 21 citations.
Cora Dataset
The Cora dataset has 1295 citations and 134 clusters. Almost every citation in Cora belongs to a nontrivial cluster; the largest cluster contains 54 citations.
One of the papers that uses these datasets is Joint Inference in Information Extraction
The dataset can be downloaded from here