Difference between revisions of "Bethard cikm2010"
(11 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
− | == | + | == Citation == |
* Title : Who Should I Cite? Learning Literature Search Models from Citation Behavior | * Title : Who Should I Cite? Learning Literature Search Models from Citation Behavior | ||
Line 7: | Line 7: | ||
== Summary == | == Summary == | ||
− | This paper describes a retrieval model to search relevant existing work in a collection of scientific articles. | + | This [[Category::paper]] describes a retrieval model to search relevant existing work in a collection of scientific articles. |
− | The authors | + | The authors claimed that the model is useful when a researcher wants to conduct a new research |
outside his/her area of expertise and needs to get familiar with prior work in the field. | outside his/her area of expertise and needs to get familiar with prior work in the field. | ||
The model incorporates various text and meta features and uses the citation networks to learn weights of these features. | The model incorporates various text and meta features and uses the citation networks to learn weights of these features. | ||
== Dataset == | == Dataset == | ||
− | ACL Anthology (~11,000 papers) | + | [[UsesDataset::ACL Anthology]] (they used ~11,000 papers) |
== Model == | == Model == | ||
− | + | Documents are ranked based on their scores. <br> | |
+ | For scoring, they used linear model between a query Q (project idea) and a document D (existing scientific article) as follows :<br><br> | ||
<math> | <math> | ||
score(Q,D) = \sum_i w_i \times f_i(Q,D) | score(Q,D) = \sum_i w_i \times f_i(Q,D) | ||
Line 22: | Line 23: | ||
== Features == | == Features == | ||
+ | To compute the score between a query Q and a document D, they used the following features : | ||
* Terms : | * Terms : | ||
− | ** TF-IDF | + | ** TF-IDF between Q and D |
* Citations : | * Citations : | ||
** Number of papers that cited D | ** Number of papers that cited D | ||
Line 29: | Line 31: | ||
** Number of citations author of D has received (if there are multiple authors, use one with the most citation counts) | ** Number of citations author of D has received (if there are multiple authors, use one with the most citation counts) | ||
*** A variant using h-index instead of raw citation counts was also explored. An author with h-index h has published h papers each of which has been cited at least h times. | *** A variant using h-index instead of raw citation counts was also explored. An author with h-index h has published h papers each of which has been cited at least h times. | ||
− | ** PageRank score calculated over the citation network (instead of the hyperlink network) | + | ** PageRank score of document collection calculated over the citation network (instead of the hyperlink network) |
* Recency: | * Recency: | ||
− | ** | + | ** year Q - year D. (effect : older papers get less scores). |
− | * Similar Topics (100 topics from LDA) | + | * Cited using similar terms : |
+ | ** TF-IDF between Q and all documents cited D | ||
+ | ** TF-IDF between Q and a vector constructed using PMI to select important terms used when other documents cited D | ||
+ | * Similar Topics (100 topics from LDA) : | ||
** Cosine similarity between Q and D | ** Cosine similarity between Q and D | ||
** Cosine similarity between Q and averaged topic distributions of all other documents which cited D | ** Cosine similarity between Q and averaged topic distributions of all other documents which cited D | ||
− | * Social habits | + | ** Topic citation count score |
− | ** | + | *** For each document in the collection, choose its most prominent topic (denoted by T). Documents cited by this document are considered to be cited by topic T. Normalized the number of topic citations for each document to get the probability. For a query Q, choose its most prominent topic, denoted by S. Topic citation score is the probability computed above for document D and topic S. |
+ | ** Entropy of D's topic distribution | ||
+ | ** Entropy of documents-citing-D's mean topic distributions | ||
+ | * Social habits : | ||
+ | ** Authors : boost D if it has been cited by authors of Q | ||
+ | ** Authors-cited-article : boost D if it has been cited by authors of Q | ||
+ | ** Authors-cited-author : boost D if it was written by authors that were cited by authors of Q | ||
+ | ** Authors-cited-venue : boost D if it appeared in the venue authors of Q has cited | ||
+ | ** Authors-coauthored : boost D if it was written by authors who have co-authored with authors of Q. | ||
== Experiments == | == Experiments == | ||
− | Feature scores | + | Feature scores were log-transformed and scaled to between 0 and 1. <br> |
− | + | They experimented with two classifiers : | |
− | * | + | * [[UsesMethod:: Logistic_regression]] |
− | * | + | * [[UsesMethod::Support_Vector_Machines]]-MAP |
− | |||
− | |||
− | [[ | ||
== Results == | == Results == | ||
− | * | + | * Performance on the dev set for logistic regression model was worse than the SVM-MAP model so they only used SVM-MAP for the test set. |
− | + | * They did (partial) ablation analysis on the test set and showed that the model using all features performed the best (evaluated by mean averaged precision). | |
− | |||
− | * | ||
− | |||
== Feature analysis == | == Feature analysis == | ||
− | Below are weights | + | Below are feature weights from the SVM-MAP model used to evaluate the test set <br> |
[[File:bjfeatanalysis.png]] | [[File:bjfeatanalysis.png]] |
Latest revision as of 22:08, 2 April 2011
Contents
Citation
- Title : Who Should I Cite? Learning Literature Search Models from Citation Behavior
- Authors : S. Bethard and D. Jurafsky
- Venue : CIKM 2010
Summary
This paper describes a retrieval model to search relevant existing work in a collection of scientific articles. The authors claimed that the model is useful when a researcher wants to conduct a new research outside his/her area of expertise and needs to get familiar with prior work in the field. The model incorporates various text and meta features and uses the citation networks to learn weights of these features.
Dataset
ACL Anthology (they used ~11,000 papers)
Model
Documents are ranked based on their scores.
For scoring, they used linear model between a query Q (project idea) and a document D (existing scientific article) as follows :
Features
To compute the score between a query Q and a document D, they used the following features :
- Terms :
- TF-IDF between Q and D
- Citations :
- Number of papers that cited D
- Number of citations for articles in the venue in which D was published
- Number of citations author of D has received (if there are multiple authors, use one with the most citation counts)
- A variant using h-index instead of raw citation counts was also explored. An author with h-index h has published h papers each of which has been cited at least h times.
- PageRank score of document collection calculated over the citation network (instead of the hyperlink network)
- Recency:
- year Q - year D. (effect : older papers get less scores).
- Cited using similar terms :
- TF-IDF between Q and all documents cited D
- TF-IDF between Q and a vector constructed using PMI to select important terms used when other documents cited D
- Similar Topics (100 topics from LDA) :
- Cosine similarity between Q and D
- Cosine similarity between Q and averaged topic distributions of all other documents which cited D
- Topic citation count score
- For each document in the collection, choose its most prominent topic (denoted by T). Documents cited by this document are considered to be cited by topic T. Normalized the number of topic citations for each document to get the probability. For a query Q, choose its most prominent topic, denoted by S. Topic citation score is the probability computed above for document D and topic S.
- Entropy of D's topic distribution
- Entropy of documents-citing-D's mean topic distributions
- Social habits :
- Authors : boost D if it has been cited by authors of Q
- Authors-cited-article : boost D if it has been cited by authors of Q
- Authors-cited-author : boost D if it was written by authors that were cited by authors of Q
- Authors-cited-venue : boost D if it appeared in the venue authors of Q has cited
- Authors-coauthored : boost D if it was written by authors who have co-authored with authors of Q.
Experiments
Feature scores were log-transformed and scaled to between 0 and 1.
They experimented with two classifiers :
Results
- Performance on the dev set for logistic regression model was worse than the SVM-MAP model so they only used SVM-MAP for the test set.
- They did (partial) ablation analysis on the test set and showed that the model using all features performed the best (evaluated by mean averaged precision).
Feature analysis
Below are feature weights from the SVM-MAP model used to evaluate the test set