Analyzing Community driven Question Answering Sites

From Cohen Courses
Revision as of 10:27, 16 October 2012 by Anikag (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Comments

  • Interesting idea. But how much mileage would you get from analyzing just the question itself?
    • Do you plan to use related questions to analyze a particular question?
    • If the related questions are used, will you use their answers?
  • How do you plan to evaluate the longevity? Do you plan to predict a range?

--Apappu 11:39, 11 October 2012 (UTC)

We discussed the project idea with Prof. Cohen. We plan to model the network of experts and concepts. The concepts are the sub-topics underlying the questions and tags in the posts on StackOverflow. The experts are the people on Stackoverflow who have designated different badges as per their performance, who post and answer questions to help the community. ( this also helps them to gain points).

We plan to start with creating a workflow of concepts for a topic, say Python. Given a topic Python, the sub-topics related to it can be categorized and chained in order to highlight the ease and popularity of the concepts among the users. We can come up with a workflow of concepts similar to a book structure.

The second task is to model the specialties of the experts by analyzing the topics of questions the experts is asking or answering. It can further lead to recommendation of questions to the right expert to answer it. For example - experts who are able to answer difficult question on the site on a particular concept, can be presented with similar questions to capture their interest.

The network of experts and the concepts can be modeled to detect the presence of hidden community in question answering websites.

Team Members

Anika Gupta

Shourabh Rawat

Abstract

Question answering communities such as Yahoo! Answers and StackOverflow have emerged as popular as well as effective means of information resource on the web. The questions along with the enitre set of corresponding answers is a big resource to explore a lot of question related to question answering. One interesting analysis is to track the lifetime of questions in such environments. The lifetime of a question can vary from the question being declared as closed by community, a short-lived question where an expert sufficiently answers a question or a question which generated a lot of interaction among users for a relatively long duration. Analyzing such question and trying to predict their longevity is one of the goals of our project. Other interesting aspect to explore is identifying questions that have not been sufficiently answered and identifying user expertise for improved recommendations and automatic tag prediction.

Revised Abstract

Question answering communities such as Yahoo! Answers and StackOverflow have emerged as popular as well as effective means of information resource on the web. The questions along with the entire set of corresponding answers is a big resource to explore a lot of question related to question answering. Given a sub-topic say Python, coming up with a workflow of concepts depending upon the difficulty of the question as well as the number of people who could answer that question, the flow of concepts can be generated. Another interesting question deals with the community detection among experts. By zooming in to the interaction taking place between the users and the interest of them in concepts, an implicit network of experts can be created. Recommending questions to the user based on learning the specific skill of the user from the past question/answering behavior, can help to direct the questions to the experts.

Datasets

The Stack Overflow Data that we plan to use is publicly available from StackOverflow under a Creative Commons license. One can download the latest version from here.

Here are some of the statistics about the data:

  • Users 440K (198K questioners, 71K answerers)
  • Questions 1M (69% with accepted answer)
  • Answers 2.8M (26% marked as accepted)
  • Votes 7.6M (93% positive)
  • Favorites 775K actions on 318K questions

Techniques Used

  • We plan to use a wide set of features - incorporating the textual as well as the network attributes.
  • To gain initial insights into the data, we'll use standard Topic Models like LDA and SVM for classification.
  • We plan to use Gephi/Jung to visualize the graph structure.

Challenges

  • Relatively unexplored dataset. Most of the work has used Yahoo! Answers data set.
  • Evaluation of the workflow of concepts.
  • Complex network dynamics like the reputation system and bounties. Understanding them key to getting good results.

Relevant Literature

  • Anderson et al studies the identification of long-lasting questions as well as predicts whether a question has been sufficiently answered on StackOverflow.
  • Adamic et al studies Yahoo! Answers to explore the interactions of users. Preliminary work on predicting the best answer.
  • Jeon at al predicts the quality of answers using non-textual features.
  • Community detection in graphs