Difference between revisions of "Bethard cikm2010"

Revision as of 16:26, 1 April 2011

Citation

Title : Who Should I Cite? Learning Literature Search Models from Citation Behavior
Authors : S. Bethard and D. Jurafsky
Venue : CIKM 2010

Summary

This paper describes a retrieval model to search relevant existing work in a collection of scientific articles. The authors claim that the model is useful when a researcher wants to conduct a new research outside his/her area of expertise and needs to get familiar with prior work in the field. The model incorporates various text and meta features and uses the citation networks to learn weights of these features.

Dataset

ACL Anthology (~11,000 papers)

Model

Documents are ranked based on their scores.
For scoring, they used linear model between a query Q (project idea) and a document D (existing scientific article) as follows :

$score(Q,D)=\sum _{i}w_{i}\times f_{i}(Q,D)$

Features

Terms :
- TF-IDF between Q and D
Citations :
- Number of papers that cited D
- Number of citations for articles in the venue in which D was published
- Number of citations author of D has received (if there are multiple authors, use one with the most citation counts)
  - A variant using h-index instead of raw citation counts was also explored. An author with h-index h has published h papers each of which has been cited at least h times.
- PageRank score of document collection calculated over the citation network (instead of the hyperlink network)
Recency:
- current year - year D. (intuition : older papers get less scores).
Similar Topics (100 topics from LDA)
- Cosine similarity between Q and D
- Cosine similarity between Q and averaged topic distributions of all other documents which cited D
- Topic citation counts
- Entropy of D's topic distribution
- Entropy of documents-citing-D's mean topic distributions
Social habits
- TF-IDF between authors of Q and authors of D (note that this just is a TF-IDF between author lists)
- TF-IDF between authors of Q and authors of
- Authors of previously cited articles by authors of Q

Experiments

Feature scores (citation counts, etc.) were log-transformed and scaled to between 0 and 1.
Experimented with two classifiers :

Logistic regression
SVM-MAP

Training / Dev / Test split

Results

Mean average precision on the dev set for different classifiers (using all features)

Logistic 50/50 model downsampled the number of negative examples to be the same as the number of positive examples.

Mean average precision on the dev and test sets using different feature sets (model : SVM-MAP)

Feature analysis

Below are weights of each feature used in the experiments

@@ Line 1: / Line 1: @@
-== Paper ==
+== Citation ==
 * Title : Who Should I Cite? Learning Literature Search Models from Citation Behavior
@@ Line 16: / Line 16: @@
 == Model ==
-Linear scoring model between a query Q (project idea) and a document D (existing scientific article) <br><br>
+Documents are ranked based on their scores. <br>
+For scoring, they used linear model between a query Q (project idea) and a document D (existing scientific article) as follows :<br><br>
 <math>
 score(Q,D) = \sum_i w_i \times f_i(Q,D)
@@ Line 23: / Line 24: @@
 == Features ==
 * Terms :
-** TF-IDF scores between Q and D
+** TF-IDF between Q and D
 * Citations :
 ** Number of papers that cited D
@@ Line 29: / Line 30: @@
 ** Number of citations author of D has received (if there are multiple authors, use one with the most citation counts)
 *** A variant using h-index instead of raw citation counts was also explored. An author with h-index h has published h papers each of which has been cited at least h times.
-** PageRank score calculated over the citation network (instead of the hyperlink network)
+** PageRank score of document collection calculated over the citation network (instead of the hyperlink network)
 * Recency:
 ** current year - year D. (intuition : older papers get less scores).
@@ Line 35: / Line 36: @@
 ** Cosine similarity between Q and D
 ** Cosine similarity between Q and averaged topic distributions of all other documents which cited D
+** Topic citation counts
+** Entropy of D's topic distribution
+** Entropy of documents-citing-D's mean topic distributions
 * Social habits
-** TF-IDF between D with ....
+** TF-IDF between authors of Q and authors of D (note that this just is a TF-IDF between author lists)
+** TF-IDF between authors of Q and authors of
+** Authors of previously cited articles by authors of Q
+**
 == Experiments ==
-Feature scores (citation counts, etc.) were log-transformed and scaled to between 0 and 1
+Feature scores (citation counts, etc.) were log-transformed and scaled to between 0 and 1. <br>
-Use two classifiers :
+Experimented with two classifiers :
 * Logistic regression
 * SVM-MAP

Difference between revisions of "Bethard cikm2010"

Revision as of 16:26, 1 April 2011

Contents

Citation

Summary

Dataset

Model

Features

Experiments

Results

Feature analysis

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools