Tag Predicting For Stackoverflow
Comments
Dandan, we've talked about this, so I won't say too much more. Keep me informed about your progress - looks interesting! --Wcohen 15:21, 10 October 2012 (UTC)
Members
Idea and the task
Try predicting the useful tag(s) for a new post or a post without proper tags yet. Stackoverflow is a computer science questions and answers website. Each post can be linked to different tags such as languages used, algorithm used, product used.
There are a lot of ideas going on, but the goal is to build a cluster of semi-supervised classifiers whose aggregated accuracy can beat a single classifier.
Instead of building a single classifier for multiple tags, build a bunch of classifiers for each tags. Iteratively train the classifiers based on the result's accuracy.
For each classifier, carefully choose features which are (very) positively related to the tag. First, preliminarily select a bunch of features positively related to the tags (e.g. using label propagation). Construct a graph of the features: vertices are the features whereas edges are the confidence of being a good feature candidate if the linked one is a good one (e.g. PMI values which can be proved). Go through the graph, starting from seed vertices, and pick up the features for each classifier.
Also, using the classifier, we can classify the unlabeled documents and then pick up the one with high confidence that they are labeled correctly.Use these high-confidence fresh labeled documents as the input and build the feature graph again. This step can be done iteratively and gain more high-confidence features to the classifier.
Data sets
Stackoverflow dump till August 2012. The data for stackoverflow.com website is 35.8GB, purely text data for the posts and related information. The data for all the stackoverflow and sibling websites is over 100GB.
Baseline Method
- One naive baseline method is that for each programming language tags, search in the post for the key words such as Java, C++, then tag the post based on the key words.
- Build a popular classifier, such as naive bayes, SVM, taking words in the document as features.
Challenges
- How to leverage Stackoverflow data effectively as each post usually does not only contain sentences but also code and other stuff?
- Relatively big dataset, how to build an efficient algorithm to process and construct the classifiers?