Difference between revisions of "Tag Predicting For Stackoverflow"

From Cohen Courses
Jump to navigationJump to search
Line 6: Line 6:
 
[[User:Dzheng|Dandan Zheng]]
 
[[User:Dzheng|Dandan Zheng]]
  
Team members, where are you?
+
[[User:Zsheikh|Zaid Sheikh]]
  
 
== Idea ==
 
== Idea ==

Revision as of 13:41, 15 October 2012

Comments

Dandan, we've talked about this, so I won't say too much more. Keep me informed about your progress - looks interesting! --Wcohen 15:21, 10 October 2012 (UTC)

Members

Dandan Zheng

Zaid Sheikh

Idea

Try predicting the useful tag(s) for a new post or a post without proper tags yet. Stackoverflow is a computer science questions and answers website. Each post can be linked to different tags such as languages used, algorithm used, product used.

There are a lot of ideas going on, but the goal is to build a cluster of semi-supervised classifiers whose aggregated accuracy can beat a single classifier.

Instead of building a single classifier for multiple tags, build a bunch of classifiers for each tags. Iteratively train the classifiers based on the result's accuracy.

For each classifier, carefully choose features which are (very) positively related to the tag. First, preliminarily select a bunch of features positively related to the tags (e.g. using label propagation). Construct a graph of the features: vertices are the features whereas edges are the confidence of being a good feature candidate if the linked one is a good one (e.g. PMI values which can be proved). Go through the graph, starting from seed vertices, and pick up the features for each classifier.

Data sets

Stackoverflow dump till August 2012. The data for stackoverflow.com website is 35.8GB, purely text data for the posts and related information. The data for all the stackoverflow and sibling websites is over 100GB.

Baseline Method

  • One naive baseline method is that for each programming language tags, search in the post for the key words such as Java, C++, then tag the post based on the key words.
  • Build a popular classifier, such as naive bayes, SVM, taking words in the document as features.

Challenges

  • How to leverage Stackoverflow data effectively as each post usually does not only contain sentences but also code and other stuff?
  • Relatively big dataset, how to build an efficient algorithm to process and construct the classifiers?