Analyzing Community driven Question Answering Sites

From Cohen Courses
Revision as of 07:40, 11 October 2012 by Apappu (talk | contribs)
Jump to navigationJump to search

Comments

  • Interesting idea. But how much mileage would you get from analyzing just the question itself?
    • Do you plan to use related questions to analyze a particular question?
    • If the related questions are used, will you use their answers?
  • How do you plan to evaluate the longevity? Do you plan to predict a range?

--Apappu 11:39, 11 October 2012 (UTC)

Team Members

Anika Gupta

Shourabh Rawat

Abstract

Question answering communities such as Yahoo! Answers and StackOverflow have emerged as popular as well as effective means of information resource on the web. The questions along with the enitre set of corresponding answers is a big resource to explore a lot of question related to question answering. One interesting analysis is to track the lifetime of questions in such environments. The lifetime of a question can vary from the question being declared as closed by community, a short-lived question where an expert sufficiently answers a question or a question which generated a lot of interaction among users for a relatively long duration. Analyzing such question and trying to predict their longevity is one of the goals of our project. Other interesting aspect to explore is identifying questions that have not been sufficiently answered and identifying user expertise for improved recommendations and automatic tag prediction.

Datasets

The Stack Overflow Data that we plan to use is publicly available from StackOverflow under a Creative Commons license. One can download the latest version from here.

Here are some of the statistics about the data:

  • Users 440K (198K questioners, 71K answerers)
  • Questions 1M (69% with accepted answer)
  • Answers 2.8M (26% marked as accepted)
  • Votes 7.6M (93% positive)
  • Favorites 775K actions on 318K questions

Techniques Used

  • We plan to use a wide set of features - incorporating the textual as well as the network attributes.
  • To gain initial insights into the data, we'll use standard Topic Models like LDA and SVM for classification.

Challenges

  • Relatively unexplored dataset. Most of the work has used Yahoo! Answers data set.
  • Complex network dynamics like the reputation system and bounties. Understanding them key to getting good results.

Relevant Literature

  • Anderson et al
  • Adamic et al studies Yahoo! Answers to explore the interactions of users. Preliminary work on predicting the best answer.
  • Jeon at al predicts the quality of answers using non-textual features.