Anderson et al KDD2012
Contents
Citation
author = {Ashton Anderson and Daniel P. Huttenlocher and Jon M. Kleinberg and Jure Leskovec}, title = {Discovering value from community activity on focused Question Answering Sites: A case study of Stack Overflow}, booktitle = {KDD}, year = {2012}, pages = {850-858}, ee = {http://doi.acm.org/10.1145/2339530.2339665}, crossref = {DBLP:conf/kdd/2012}, bibsource = {DBLP, http://dblp.uni-trier.de}
Online version
Summary
Question Answering websites like Stack Overflow and Quora are growing into large repository of valuable knowledge with the help of community driven knowledge creation process. In this Case Study of Stack Overflow, the authors study it's community driven knowledge creation process and investigate the dynamics of the community activity that shapes the set of answers, both in how answers and voters arrive over time and how it eventually influences the final outcome. They thus consider the entire set of answers to a question as there fundamental unit of analysis instead of analyzing just the best one. The authors observe significant assortativity in the reputation of co-answerers, relationships between reputation and answer speed, and the probability of answer being chosen as the best one on the temporal characteristics of answer arrivals. They then apply there analysis on two prediction tasks. First, Predicting the long term value of the question and it's answers. Second, Predicting weather a question has been appropriately answered.
Dataset Description
The Stack Overflow data used in this paper is publicly available from StackOverflow under a Creative Commons license. One can download the latest version from here.
Here are some of the statistics about the data used by the authors:
- Users 440K (198K questioners, 71K answerers)
- Questions 1M (69% with accepted answer)
- Answers 2.8M (26% marked as accepted)
- Votes 7.6M (93% positive)
- Favorites 775K actions on 318K questions
Motivation
The motivation of the paper is to be able to understand the community dynamics at Question Answering sites like Stack Overflow by considering questions with there set of corresponding answers and not as free standing question answer pairs. Complex questions often generate multiple good answers from different experts bringing out different views and even the best Answers when viewed in isolation may not capture the knowledge created through community interaction around that question. They aim to be able to identify ad highlight questions of lasting value as soon as possible after they have appeared on the site, so that users can be directed to them. For experts who are able to answer difficult questions, there is potential to identify questions that have not been successfully answered and highlight them for increased attention.
Features Used
- Questioner features (SA), 4 features total:
* questioner reputation, * # of questioner’s questions and answers, * questioner’s percentage of accepted answers on their previous questions.
- Activity and Q/A quality measures (SB), 8 features total:
* # of favorites, * # of page views, * # positive and negative votes on question, * # of answers, * maximum answerer reputation, * highest answer score, * reputation of answerer who wrote highest scoring answer,
- Community process features (SC), 8 features total:
* average answerer reputation, * median answerer reputation, * fraction of sum of answerer reputations contributed by max answerer reputation, * sum of answerer reputations, * length of answer by highest-reputation answerer, * # of comments on answer by highest-reputation answerer, * length of highest-scoring answer, * # of comments on highest-scoring answer.
- Temporal process features (SD), 7 features total:
* average time between answers, * median time between answers, * minimum time between answers, * time-rank of highest-scoring answer, * wall-clock time elapsed between question creation and highest-scoring answer, * time-rank of answer by highest reputation answerer, * wall-clock time elapsed between question creation and answer by highest-reputation answerer.
Task Description and Evaluation
There are two concrete tasks that the authors have tried to solve.
- Predicting long term value of a question:
* Proxy for the long term value: Number of pageviews of a question with its answers in a given time frame * Analysis Restricted to questions created in the same month and prediction done on page views one year later. * Binary classification [Page views in bottom or top Quartile/Halfs ]with data set containing 28,772 examples using Logistic Regression with 10 fold Cross Validation. * Prediction made using information available from time frames of 1,2,24 and 72 hours after the question posted. * Baseline: Crowd Sourced Features - # of Favorites on the question, No of positive minus negative votes on the question. *
- Predicting whether a question has been sufficiently answered.