Difference between revisions of "Github Repo Recommendation:Topic Model meets Code"

From Cohen Courses
Jump to navigationJump to search
(Created page with '==Task== Item recommendation == Overview == [https://github.com/ Github] is a social network site for programmers, where they can host source code ''repositories'' (also called …')
 
m
Line 7: Line 7:
 
In 2009, Github hosted a recommendation contest, where the objective was to recommend repositories to users. The dataset contained 56K users, 120K repositories, and 440K user-watches-repo relationships between them.
 
In 2009, Github hosted a recommendation contest, where the objective was to recommend repositories to users. The dataset contained 56K users, 120K repositories, and 440K user-watches-repo relationships between them.
  
Collaborative filtering typically uses some form of matrix factorization technique, and ignores content (in this case, the repos).
+
Collaborative filtering typically uses some form of matrix factorization technique, and ignores content (in this case, the repos/code).
 
In this project, we propose to incorporate inherent ''topics'' of the source code (of repos) to improve predictions.
 
In this project, we propose to incorporate inherent ''topics'' of the source code (of repos) to improve predictions.
  
Line 18: Line 18:
 
==Baseline Method==
 
==Baseline Method==
 
A collaborative-filtering based method, such as SVD
 
A collaborative-filtering based method, such as SVD
 
==Proposed Method==
 
I propose combining a search for negative sentiment within a sentence with a method of detecting whether the target of a sentence is a person as a proxy for ad hominem (negative sentiment aimed at persons, not ideas). In addition, I would like to try machine learning based on more advanced features, such as part-of-speech tags and inferred topic models, to build additional classifiers.
 
  
 
==Challenges==
 
==Challenges==

Revision as of 22:18, 8 October 2012

Task

Item recommendation

Overview

Github is a social network site for programmers, where they can host source code repositories (also called repos). Users can watch repos they are interested in; when a user is watching a repo, s/he will receive status updates on its activities (such as commits, tagging, etc…).

In 2009, Github hosted a recommendation contest, where the objective was to recommend repositories to users. The dataset contained 56K users, 120K repositories, and 440K user-watches-repo relationships between them.

Collaborative filtering typically uses some form of matrix factorization technique, and ignores content (in this case, the repos/code). In this project, we propose to incorporate inherent topics of the source code (of repos) to improve predictions.

Team

Naoki Orii

Datasets

Baseline Method

A collaborative-filtering based method, such as SVD

Challenges

  • As it may be expensive to perform topic modeling on 120K repos, we may perform prediction on a subset of this data

More info about the Github contest

More information about the Github contest is available below:

Related Work