Rbalasub project abstract
Tag Extractor - Liberating information from academic publications since 2009
Surveying a new sub-field is a frequent requirement for researchers. With the proliferation of focused conferences, large bodies of literature can be found for almost any subtopic at all levels of granularity. To get upto date with the latest in the literature in the field, researchers depend on the expertise of colloborators, broad survey papers or more commonly searches on paper repositories. While this approach works well, it is a low recall process. We propose an automatic paper recommender system that analyzes large collections of conference proceedings, journals etc., extracts keywords from them and builds a matrix of publications with edges representing a "BUILDS-ON" relation. The system will also build profiles for authors. With the help of the publication dates, it will also be possible to track trends over time.
In this project, we focus on the first step of building the system which is the extraction of keywords from academic publications. We use the Citeseer dataset for our experiments. This dataset has proceedings from major computer science related conferences and journals dating back over several years. Author supplied keywords are used to evaluate the quality of the extracted keywords.
As a first step, the section of the paper that describes the author(s)' contributions is identified. Typically, this section is sandwiched between the introduction and experiments sections. Hand crafted rules are used to perform this step although it can be replaced with more sophisticated methods in future incase the rules get complicated. Once the relevant section is identified, we experiment with multiple methods to extract keywords. The keywords to be extracted are of several varieties. They can represent
- a broad field e.g. Information Extraction
- a specific recognized subtask e.g. Named Entity Recognition
- a specific model e.g. Conditional Random Fields
- an algorithm e.g Conjugate Gradient descent
The first approach is to use sequential models like CRFs or a variant of it on the titles of papers to identify deep keywords. To extract more general keywords, for instance the phrase "topic models" from a paper describing an extension to LDA, we propose using ranking methods to identify key sentences and run dependency parsers on them to extract the head noun. The third method that we plan to experiment with is set expansion methods like the approached used by SEAL (developed by Richard Wang and William Cohen). We will use a seed set of such tags to learn contexts from where new tags can be extracted. Due to the citation information available in the corpus, we plan to use author information as additional information to discover contexts.