Bbd project abstract

From Cohen Courses
Jump to navigationJump to search

Problem Definition and Dataset

Given a set of papers X, find a set of papers Y related to papers in X. Using Citeseer graph and text corpora.

Motivation

Citeseer has a huge set of papers in all areas of computer science. It is used heavily by research community and information access problems are not fully solved. Solving this problem will help students for exploring a particular area of their interest. Also instructors will get benefited while designing courses to decide which papers to put as required reading and optional reading for the course topic.

Relevant Superpower

Access to whole citeseer dataset :)

Evaluation

I can take few course webpages(best example : Information extraction website for last few years) and get training & test data for related papers.

Techniques that can be used to solve this problem

Both text based and graph based techniques can be combined to extract related papers from the corpus:

  • Text based similarity - This won't be just bag of words similarity but the papers should match in the basic concept they are talking about, like the topic of paper.
   * Topic match - Can be approximated by text similarity of abstract sections. 
     Finding out topic from a paper or set of papers can be another interesting IE problem.
   * Similarity of "Keywords section" of papers can also help deciding related papers.
  • Graph based similarity - We can use random walk based graph similarity.
   * Citation pattern match - The papers citing common papers and papers cited by common papers can be considered as related.
     From citeseer we get num of total citations for a candidate paper, which can be useful feature to consider while estimating
     importance of the paper.
   * Finding Basic papers / Survey papers - Basic papers will be cited by all recent papers in the area. 
     Survey papers will cite all papers covering competitive approaches in the area.

What question to answer

Can we do set entity expansion by combining graph and text data in an intelligent way?

Probable Project Partner

Ramnath Balasubramanyan

References