Difference between revisions of "Github Repo Recommendation:Topic Model meets Code"

From Cohen Courses
Jump to navigationJump to search
m
Line 28: Line 28:
  
 
==Related Work==
 
==Related Work==
 +
Using topic models to improve collaborative filtering has been investigated in the following paper:
 
* C Wang and D Blei,  "[http://www.cs.princeton.edu/~chongw/papers/WangBlei2011.pdf Collaborative Topic Modeling for Recommending Scientific Articles]", in KDD 2011
 
* C Wang and D Blei,  "[http://www.cs.princeton.edu/~chongw/papers/WangBlei2011.pdf Collaborative Topic Modeling for Recommending Scientific Articles]", in KDD 2011
 +
 +
The following papers describe applying topic models to code:
 
* E Linstead, P Rigor, S Bajracharya, C Lopes, and P Baldi, "[http://dl.acm.org/citation.cfm?id=1321631.1321709 Mining concepts from code with probabilistic topic models]", ASE '07 Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering
 
* E Linstead, P Rigor, S Bajracharya, C Lopes, and P Baldi, "[http://dl.acm.org/citation.cfm?id=1321631.1321709 Mining concepts from code with probabilistic topic models]", ASE '07 Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering
 
* K Tian, M Revelle, and D Poshyvanyk, "[http://www.cs.wm.edu/~ktian/pub/MSR2009_Tian.pdf Using latent dirichlet allocation for automatic categorization of software]", MSR '09 Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories
 
* K Tian, M Revelle, and D Poshyvanyk, "[http://www.cs.wm.edu/~ktian/pub/MSR2009_Tian.pdf Using latent dirichlet allocation for automatic categorization of software]", MSR '09 Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories

Revision as of 23:23, 8 October 2012

Task

Item recommendation

Overview

Github is a social network site for programmers, where they can host source code repositories (also called repos). Users can watch repos they are interested in; when a user is watching a repo, s/he will receive status updates on its activities (such as commits, tagging, etc…).

In 2009, Github hosted a recommendation contest, where the objective was to recommend repositories to users. The dataset contained 56K users, 120K repositories, and 440K user-watches-repo relationships between them.

Collaborative filtering typically uses some form of matrix factorization technique, and ignores content (in this case, the repos/code). Alternatively, in this project, we propose to incorporate inherent topics of the source code (of repos) to improve predictions.

Team

Naoki Orii

Datasets

Baseline Method

A collaborative-filtering based method, such as SVD

Challenges

  • As it may be expensive to perform topic modeling on 120K repos, we may perform prediction on a subset of this data

More info about the Github contest

More information about the Github contest is available below:

Related Work

Using topic models to improve collaborative filtering has been investigated in the following paper:

The following papers describe applying topic models to code: