Difference between revisions of "Github Repo Recommendation:Topic Model meets Code"
(Created page with '==Task== Item recommendation == Overview == [https://github.com/ Github] is a social network site for programmers, where they can host source code ''repositories'' (also called …') |
m |
||
Line 7: | Line 7: | ||
In 2009, Github hosted a recommendation contest, where the objective was to recommend repositories to users. The dataset contained 56K users, 120K repositories, and 440K user-watches-repo relationships between them. | In 2009, Github hosted a recommendation contest, where the objective was to recommend repositories to users. The dataset contained 56K users, 120K repositories, and 440K user-watches-repo relationships between them. | ||
− | Collaborative filtering typically uses some form of matrix factorization technique, and ignores content (in this case, the repos). | + | Collaborative filtering typically uses some form of matrix factorization technique, and ignores content (in this case, the repos/code). |
In this project, we propose to incorporate inherent ''topics'' of the source code (of repos) to improve predictions. | In this project, we propose to incorporate inherent ''topics'' of the source code (of repos) to improve predictions. | ||
Line 18: | Line 18: | ||
==Baseline Method== | ==Baseline Method== | ||
A collaborative-filtering based method, such as SVD | A collaborative-filtering based method, such as SVD | ||
− | |||
− | |||
− | |||
==Challenges== | ==Challenges== |
Revision as of 22:18, 8 October 2012
Contents
Task
Item recommendation
Overview
Github is a social network site for programmers, where they can host source code repositories (also called repos). Users can watch repos they are interested in; when a user is watching a repo, s/he will receive status updates on its activities (such as commits, tagging, etc…).
In 2009, Github hosted a recommendation contest, where the objective was to recommend repositories to users. The dataset contained 56K users, 120K repositories, and 440K user-watches-repo relationships between them.
Collaborative filtering typically uses some form of matrix factorization technique, and ignores content (in this case, the repos/code). In this project, we propose to incorporate inherent topics of the source code (of repos) to improve predictions.
Team
Datasets
- The dataset from the Github contest is available at https://github.s3.amazonaws.com/data/download.zip
Baseline Method
A collaborative-filtering based method, such as SVD
Challenges
- As it may be expensive to perform topic modeling on 120K repos, we may perform prediction on a subset of this data
More info about the Github contest
More information about the Github contest is available below:
- https://github.com/blog/466-the-2009-github-contest
- https://github.com/blog/481-about-the-github-contest
Related Work
- C Wang and D Blei, "Collaborative Topic Modeling for Recommending Scientific Articles", in KDD 2011
- E Linstead, P Rigor, S Bajracharya, C Lopes, and P Baldi, "Mining concepts from code with probabilistic topic models", ASE '07 Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering
- K Tian, M Revelle, and D Poshyvanyk, "Using latent dirichlet allocation for automatic categorization of software", MSR '09 Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories