Difference between revisions of "Standard Citation Datasets"
From Cohen Courses
Jump to navigationJump to searchLine 1: | Line 1: | ||
− | Following are two datasets considered as standards to be used for the problem of | + | Following are two datasets considered as standards to be used for the problem of [Citation Matching: Citation Matching]. |
In citation matching, a cluster is a set of citations that refer to the same paper, and a nontrivial cluster contains more | In citation matching, a cluster is a set of citations that refer to the same paper, and a nontrivial cluster contains more | ||
than one citation. | than one citation. |
Revision as of 04:06, 7 December 2011
Following are two datasets considered as standards to be used for the problem of [Citation Matching: Citation Matching]. In citation matching, a cluster is a set of citations that refer to the same paper, and a nontrivial cluster contains more than one citation.
CiteSeer Dataset
The CiteSeer dataset has 1563 citations and 906 clusters. Contains four sections, each on a different topic. Over two-thirds of the clusters are singletons; largest cluster has 21 citations.
Cora Dataset
The Cora dataset has 1295 citations and 134 clusters. Almost every citation in Cora belongs to a nontrivial cluster; the largest cluster contains 54 citations.
One of the papers that uses these datasets is Joint Inference in Information Extraction
The dataset can be downloaded from here