Difference between revisions of "Standard Citation Datasets"

From Cohen Courses
Jump to navigationJump to search
(Created page with 'Following are two datasets considered as standards to be used for the problem of Citation Matching:Citation Matching. In citation matching, a cluster is a set of citations th…')
 
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
Following are two datasets considered as standards to be used for the problem of [[Citation Matching:Citation Matching]].
+
Following are two datasets considered as standards to be used for the problem of [[Citation Matching|Citation Matching]].
 
In citation matching, a cluster is a set of citations that refer to the same paper, and a nontrivial cluster contains more
 
In citation matching, a cluster is a set of citations that refer to the same paper, and a nontrivial cluster contains more
 
than one citation.
 
than one citation.

Latest revision as of 05:08, 7 December 2011

Following are two datasets considered as standards to be used for the problem of Citation Matching. In citation matching, a cluster is a set of citations that refer to the same paper, and a nontrivial cluster contains more than one citation.

CiteSeer Dataset

The CiteSeer dataset has 1563 citations and 906 clusters. Contains four sections, each on a different topic. Over two-thirds of the clusters are singletons; largest cluster has 21 citations.

Cora Dataset

The Cora dataset has 1295 citations and 134 clusters. Almost every citation in Cora belongs to a nontrivial cluster; the largest cluster contains 54 citations.

One of the papers that uses these datasets is Joint Inference in Information Extraction

The dataset can be downloaded from here