Standard Citation Datasets

Following are two datasets considered as standards to be used for the problem of Citation Matching. In citation matching, a cluster is a set of citations that refer to the same paper, and a nontrivial cluster contains more than one citation.

CiteSeer Dataset

The CiteSeer dataset has 1563 citations and 906 clusters. Contains four sections, each on a different topic. Over two-thirds of the clusters are singletons; largest cluster has 21 citations.

Cora Dataset

The Cora dataset has 1295 citations and 134 clusters. Almost every citation in Cora belongs to a nontrivial cluster; the largest cluster contains 54 citations.

One of the papers that uses these datasets is Joint Inference in Information Extraction

The dataset can be downloaded from here

Standard Citation Datasets

CiteSeer Dataset

Cora Dataset

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools