Bethard cikm2010
From Cohen Courses
Jump to navigationJump to search
This paper describes a retrieval model to search relevant existing work in a collection of scientific articles. The authors claim that the model is useful when a researcher wants to conduct a new research outside his/her area of expertise and needs to get familiar with prior work in the field. The model incorporates various text and meta features and uses the citation networks to learn weights of these features.
ACL Anthology (~11,000 papers)
Linear scoring model between a query Q (project idea) and a document D (existing scientific article)
- Terms :
- TF-IDF scores between Q and D
- Citations :
- Number of papers that cited D
- Number of citations for articles in the venue in which D was published
- Number of citations author of D has received (if there are multiple authors, use one with the most citation counts)
- A variant using h-index instead of raw citation counts was also explored. An author with h-index h has published h papers each of which has been cited at least h times.
- PageRank score calculated over the citation network (instead of the hyperlink network)
- Recency:
- current year - year D. (intuition : older papers get less scores).
- Similar Topics (100 topics from LDA)
- Cosine similarity between Q and D
- Cosine similarity between Q and averaged topic distributions of all other documents which cited D
- Social habits
- TF-IDF between D with ....
Feature scores (citation counts, etc.) were log-transformed and scaled to between 0 and 1 Use two classifiers :
- Logistic regression
- Mean average precision on the dev set for different classifiers (using all features)
- Mean average precision on the dev and test sets using different feature sets (model : SVM-MAP)