Difference between revisions of "Analyzing Community driven Question Answering Sites"

From Cohen Courses
Jump to navigationJump to search
Line 5: Line 5:
  
 
== Abstract ==
 
== Abstract ==
Question answering communities such as [http://answers.yahoo.com/ Yahoo! Answers] and [http://stackoverflow.com/ StackOverflow] have emerged as popular as well as effective means of information resource on the web. The questions along with the enitre set of corresponding answers is a big resource to explore a lot of question related to question answering. One interesting analysis is to keep track of the lifetime of a question. The lifetime of a question can vary from the question being declared as ''closed'' by community, a short-lived question where an expert sufficiently answers a question or a question which generated a lot of interaction among users for a relatively long duration. Analyzing such question and trying to predict their longevity is one of the goals of our project.
+
Question answering communities such as [http://answers.yahoo.com/ Yahoo! Answers] and [http://stackoverflow.com/ StackOverflow] have emerged as popular as well as effective means of information resource on the web. The questions along with the enitre set of corresponding answers is a big resource to explore a lot of question related to question answering.  
In the process, trying to predict the best answer among the other answers is also a goal.
+
One interesting analysis is to track the lifetime of questions in such environments. The lifetime of a question can vary from the question being declared as ''closed'' by community, a short-lived question where an expert sufficiently answers a question or a question which generated a lot of interaction among users for a relatively long duration. Analyzing such question and trying to predict their longevity is one of the goals of our project. Other interesting aspect to explore is identifying questions that have not been sufficiently answered and identifying user expertise for improved recommendations and automatic tag prediction.
We also plan to solve the problem of identifying sufficiently answered questions. Given a question, identifying the expertise in a domain is also an interesting question whose answer we
+
 
ll try to find.
 
 
== Datasets ==
 
== Datasets ==
The [[UsesDataset::Stack Overflow|Stack Overflow Data ]] used in this paper is publicly available from StackOverflow under a Creative Commons license.
+
The [[UsesDataset::Stack Overflow|Stack Overflow Data ]] that we plan to use is publicly available from StackOverflow under a Creative Commons license. One can download the latest version from [http://blog.stackoverflow.com/category/cc-wiki-dump/ here].
One can download the latest version from [http://blog.stackoverflow.com/category/cc-wiki-dump/ here].
 
  
Here are some of the statistics about the data used by the authors:
+
Here are some of the statistics about the data:
  
 
* Users 440K (198K questioners, 71K answerers)
 
* Users 440K (198K questioners, 71K answerers)
Line 22: Line 20:
  
 
== Baseline==
 
== Baseline==
 +
For our baseline we would use 
  
 
==Techniques Used ==
 
==Techniques Used ==
 
* We plan to use a wide set of features - incorporating the textual as well as the network attributes.
 
* We plan to use a wide set of features - incorporating the textual as well as the network attributes.
 
* To gain initial insights into the data, we'll use standard Topic Models like LDA and SVM for classification.
 
* To gain initial insights into the data, we'll use standard Topic Models like LDA and SVM for classification.
 +
 
== Challenges==
 
== Challenges==
* Relatively unexplored dataset. Most of the work has used Yahoo! Answers dataset.
+
* Relatively unexplored dataset. Most of the work has used Yahoo! Answers data set.
 +
* Complex network dynamics like the reputation system and bounties. Understanding them key to getting good results.
  
 
== Relevant Literature ==
 
== Relevant Literature ==

Revision as of 09:52, 9 October 2012

Team Members

Anika Gupta

Shourabh Rawat

Abstract

Question answering communities such as Yahoo! Answers and StackOverflow have emerged as popular as well as effective means of information resource on the web. The questions along with the enitre set of corresponding answers is a big resource to explore a lot of question related to question answering. One interesting analysis is to track the lifetime of questions in such environments. The lifetime of a question can vary from the question being declared as closed by community, a short-lived question where an expert sufficiently answers a question or a question which generated a lot of interaction among users for a relatively long duration. Analyzing such question and trying to predict their longevity is one of the goals of our project. Other interesting aspect to explore is identifying questions that have not been sufficiently answered and identifying user expertise for improved recommendations and automatic tag prediction.

Datasets

The Stack Overflow Data that we plan to use is publicly available from StackOverflow under a Creative Commons license. One can download the latest version from here.

Here are some of the statistics about the data:

  • Users 440K (198K questioners, 71K answerers)
  • Questions 1M (69% with accepted answer)
  • Answers 2.8M (26% marked as accepted)
  • Votes 7.6M (93% positive)
  • Favorites 775K actions on 318K questions

Baseline

For our baseline we would use

Techniques Used

  • We plan to use a wide set of features - incorporating the textual as well as the network attributes.
  • To gain initial insights into the data, we'll use standard Topic Models like LDA and SVM for classification.

Challenges

  • Relatively unexplored dataset. Most of the work has used Yahoo! Answers data set.
  • Complex network dynamics like the reputation system and bounties. Understanding them key to getting good results.

Relevant Literature

  • Anderson et al
  • Adamic et al studies Yahoo! Answers to explore the interactions of users. Preliminary work on predicting the best answer.
  • Jeon at al predicts the quality of answers using non-textual features.