Bbd and rbalasub - project status report
Contents
Dataset
We are using data from Citeseer for our project. The dataset contains the entire database that is the backbone of CiteSeer and CiteULike. Since our primary focus in the project is on Machine Learning papers, we extracted ~10K papers from the database that are from machine learning related conferences.
A seed set of keywords about machine learning concepts and entities (hereafter referred to as tag dictionary(TD)) were also extracted using SEAL (Wang and Cohen). The TD has ~500 entities. Since we do not have any papers that have been hand tagged with entities from this set, a simple keyword match process is used to assign tags with papers. Therefore a paper is tagged with all keywords from TD that occurs in the paper. In our sample, we have 150,000 matches which is approxmately 10-15 keyword matches per publication.
Plan
The project plan is still to associate publications with keywords that function like tags. Having a richer representation for publications will help in downstream tasks like organizing sets of publications thematically to construct a personalized recommender.
We will use two approaches to the tagging task
1. Graph based approaches based on random walks with restarts
A graph is constructed with nodes for every paper, author, tag and word. To obtain a list of entities, a random walk from the paper is initiated which terminates in nodes of type tag. We use ghirl for constructing the graph. Till now we have done following things :
- Used SEAL tool and extracted entities by tuning seeds and SEAL parameters to get ML related entities.
- Wrote perl scripts to match these entities in the text papers.
- Wrote perl scripts to generate GHIRL graph
- Limited experiments with GHIRL query language to get hands on experience of how this system works.
Some hurdles in using ghirl so far have been - We have no experience in training Ghirl for reranking results or adjusting path weights and there was limited documentation about it.
2. Topic Models
We propose two variants of topic models that
- Entity-topic models - based on Statistical Entity-Topic models by Newman, Smyth et al. In this topic model, every document emits entities in addition to words in the documents. Running this model on the corpus will return a entity distribution for each topic. There is no straightforward way to infer new entities from this model since the number of entities have to be predetermined. We can however sample from the entity distribution associated with the topics assigned to the document.
- Labeled LDA - based on the approach in Ramage, Nallapati et al EMNLP 2009.
We are currently in the process of implementing this topic model using Gibbs sampling for inference. The code is written in C++ and uses routines from HBC (Hierarchical Bayes Compiler by Hal Daume). The topic model is an extension of LDA where the topic distribution sampled for every document is influenced by the tags (ML entities here) that are associated with the document. For documents with no tags, we run inference to predict possible tags for documents.