Bbd project abstract
Contents
Problem Definition and Dataset
Given a set of papers X, find a set of papers Y related to papers in X. Using Citeseer graph and text corpora.
Motivation
Citeseer has a huge set of papers in all areas of computer science. It is used heavily by research community and information access problems are not fully solved. Solving this problem will help students for exploring a particular area of their interest. Also instructors will get benefited while designing courses to decide which papers to put as required reading and optional reading for the course topic.
Relevant Superpower
Access to whole citeseer dataset :)
Evaluation
I can take few course webpages(best example : Information extraction website for last few years) and get training & test data for related papers.
Techniques that can be used to solve this problem
Both text based and graph based techniques can be combined to extract related papers from the corpus:
- Text based similarity - This won't be just bag of words similarity but the papers should match in the basic concept they are talking about, like the topic of paper.
* Topic match - Can be approximated by text similarity of abstract sections. Finding out topic from a paper or set of papers can be another interesting IE problem. * Similarity of "Keywords section" of papers can also help deciding related papers.
- Graph based similarity - We can use random walk based graph similarity.
* Citation pattern match - The papers citing common papers and papers cited by common papers can be considered as related. From citeseer we get num of total citations for a candidate paper, which can be useful feature to consider while estimating importance of the paper. * Finding Basic papers / Survey papers - Basic papers will be cited by all recent papers in the area. Survey papers will cite all papers covering competitive approaches in the area.
What question to answer
Can we do set entity expansion by combining graph and text data in an intelligent way?
Probable Project Partner
References
- Einat Minkov and William W. Cohen, Learning Graph Walk Based Similarity Measures for Parsed Text, EMNLP-2008.
- Dror G. Feitelson et. al., Predictive ranking of computer scientists using CiteSeer data, Journal of Documentation 2004.
- Richard Wang and William W. Cohen, Iterative Set Expansion of Named Entities Using the Web, ICDM-2008.
- Richard Wang and William Cohen, Language-Independent Set Expansion of Named Entities using the Web, ICDM-2007.
- Oren Etzioni et. al., Unsupervised Named-Entity Extraction from the Web: An Experimental Study.