Github Repo Recommendation:Topic Model meets Code

From Cohen Courses
Revision as of 15:46, 10 October 2012 by Norii (talk | contribs) (→‎Team)
Jump to navigationJump to search

Comments

Looks like a nice, well-defined project. It's not immediately obvious to me how you plan to do the content-based recommendation. Are you going to use Blei's CTR? or is there some other approach you want to use?

Suggestions: the GraphLab matrix factorization code has been pretty widely used, and is fairly scalable

Good luck! --Wcohen 15:20, 10 October 2012 (UTC)

Task

Item recommendation

Overview

Github is a social network site for programmers, where they can host source code repositories (also called repos). Users can watch repos they are interested in; when a user is watching a repo, s/he will receive status updates on its activities (such as commits, tagging, etc…).

In 2009, Github hosted a recommendation contest, where the objective was to recommend repositories to users. The dataset contained 56K users, 120K repositories, and 440K user-watches-repo relationships between them.

Collaborative filtering typically uses some form of matrix factorization technique, and ignores content (in this case, the repos/code). Alternatively, in this project, we propose to incorporate inherent topics of the source code (of repos) to improve predictions.

Team

Naoki Orii and I'm looking for teammates

Datasets

Baseline Method

A collaborative-filtering based method, such as SVD

Challenges

  • As it may be expensive to perform topic modeling on 120K repos, we may perform prediction on a subset of this data

More info about the Github contest

More information about the Github contest is available below:

Related Work

Using topic models to improve collaborative filtering has been investigated in the following paper:

The following papers describe applying topic models to code: