Difference between revisions of "Github Repo Recommendation:Topic Model meets Code"

From Cohen Courses
Jump to navigationJump to search
m
 
(9 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 +
== Comments ==
 +
 +
Looks like a nice, well-defined project. It's not immediately obvious to me how you plan to do the content-based recommendation.  Are you going to use Blei's CTR? or is there some other approach you want to use?
 +
 +
Suggestions: the GraphLab matrix factorization code has been pretty widely used, and is fairly scalable
 +
 +
Good luck! --[[User:Wcohen|Wcohen]] 15:20, 10 October 2012 (UTC)
 +
 
==Task==
 
==Task==
 
Item recommendation
 
Item recommendation
Line 11: Line 19:
  
 
==Team==
 
==Team==
[[User:Norii|Naoki Orii]]
+
[[User:Norii|Naoki Orii]] and I'm looking for teammates
  
 
==Datasets==
 
==Datasets==
 
* The dataset from the Github contest is available at https://github.s3.amazonaws.com/data/download.zip
 
* The dataset from the Github contest is available at https://github.s3.amazonaws.com/data/download.zip
  
==Baseline Method==
+
==Methodology==
 +
 
 +
===Baseline method===
 +
 
 
A collaborative-filtering based method, such as SVD
 
A collaborative-filtering based method, such as SVD
 +
 +
===Collaborative Topic Modeling===
 +
 +
We will basically use the Collaborative Topic Regression model from Wang and Blei's "[http://www.cs.cmu.edu/~chongw/papers/WangBlei2011.pdf Collaborative Topic Modeling for Recommending Scientific Articles]". An initial approach would be to combine all files in a given repository and treat it as a single document.
 +
 +
Some possible questions worth addressing:
 +
* How should we preprocess the code so as to improve the resulting topics?
 +
* What are the effects of including/removing code comments and README files, as they can sometimes contain useful information about the repository content ([http://github.com/rails/rails/blob/master/README.rdoc README file from an example repo]: ''Rails is a web-application framework that includes everything needed to create database-backed web applications according to the Model-View-Controller (MVC) pattern...'')
 +
* In order to reduce computational costs, can we learn topics from a subset of the documents? If yes, how can we collect sample documents that are representative of the original distribution?
 +
* Instead of simply combining all files in a given repository and treating it as a single document, can we more accurately model the corpus by modeling it as a collection of collection of documents?
  
 
==Challenges==
 
==Challenges==
Line 28: Line 49:
  
 
==Related Work==
 
==Related Work==
* C Wang and D Blei,  "[http://www.cs.princeton.edu/~chongw/papers/WangBlei2011.pdf Collaborative Topic Modeling for Recommending Scientific Articles]", in KDD 2011
+
Using topic models to improve collaborative filtering has been investigated in the following paper:
 +
* C Wang and D Blei,  "[http://www.cs.cmu.edu/~chongw/papers/WangBlei2011.pdf Collaborative Topic Modeling for Recommending Scientific Articles]", in KDD 2011
 +
 
 +
The following papers describe applying topic models to code:
 
* E Linstead, P Rigor, S Bajracharya, C Lopes, and P Baldi, "[http://dl.acm.org/citation.cfm?id=1321631.1321709 Mining concepts from code with probabilistic topic models]", ASE '07 Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering
 
* E Linstead, P Rigor, S Bajracharya, C Lopes, and P Baldi, "[http://dl.acm.org/citation.cfm?id=1321631.1321709 Mining concepts from code with probabilistic topic models]", ASE '07 Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering
 
* K Tian, M Revelle, and D Poshyvanyk, "[http://www.cs.wm.edu/~ktian/pub/MSR2009_Tian.pdf Using latent dirichlet allocation for automatic categorization of software]", MSR '09 Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories
 
* K Tian, M Revelle, and D Poshyvanyk, "[http://www.cs.wm.edu/~ktian/pub/MSR2009_Tian.pdf Using latent dirichlet allocation for automatic categorization of software]", MSR '09 Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories

Latest revision as of 09:14, 16 October 2012

Comments

Looks like a nice, well-defined project. It's not immediately obvious to me how you plan to do the content-based recommendation. Are you going to use Blei's CTR? or is there some other approach you want to use?

Suggestions: the GraphLab matrix factorization code has been pretty widely used, and is fairly scalable

Good luck! --Wcohen 15:20, 10 October 2012 (UTC)

Task

Item recommendation

Overview

Github is a social network site for programmers, where they can host source code repositories (also called repos). Users can watch repos they are interested in; when a user is watching a repo, s/he will receive status updates on its activities (such as commits, tagging, etc…).

In 2009, Github hosted a recommendation contest, where the objective was to recommend repositories to users. The dataset contained 56K users, 120K repositories, and 440K user-watches-repo relationships between them.

Collaborative filtering typically uses some form of matrix factorization technique, and ignores content (in this case, the repos/code). Alternatively, in this project, we propose to incorporate inherent topics of the source code (of repos) to improve predictions.

Team

Naoki Orii and I'm looking for teammates

Datasets

Methodology

Baseline method

A collaborative-filtering based method, such as SVD

Collaborative Topic Modeling

We will basically use the Collaborative Topic Regression model from Wang and Blei's "Collaborative Topic Modeling for Recommending Scientific Articles". An initial approach would be to combine all files in a given repository and treat it as a single document.

Some possible questions worth addressing:

  • How should we preprocess the code so as to improve the resulting topics?
  • What are the effects of including/removing code comments and README files, as they can sometimes contain useful information about the repository content (README file from an example repo: Rails is a web-application framework that includes everything needed to create database-backed web applications according to the Model-View-Controller (MVC) pattern...)
  • In order to reduce computational costs, can we learn topics from a subset of the documents? If yes, how can we collect sample documents that are representative of the original distribution?
  • Instead of simply combining all files in a given repository and treating it as a single document, can we more accurately model the corpus by modeling it as a collection of collection of documents?

Challenges

  • As it may be expensive to perform topic modeling on 120K repos, we may perform prediction on a subset of this data

More info about the Github contest

More information about the Github contest is available below:

Related Work

Using topic models to improve collaborative filtering has been investigated in the following paper:

The following papers describe applying topic models to code: