Pal et al CIKM 2010

From Cohen Courses
Revision as of 20:28, 3 October 2012 by Ymiao (talk | contribs) (→‎Evaluation)
Jump to navigationJump to search

This a Paper discussed in Social Media Analysis 10-802 in Fall 2012.

Citation

Expert Identification in Community Question Answering: Exploring Question Selection Bias. Aditya Pal, Joseph A. Konstan. In Proceedings of CIKM 2010, pages 1505-1508.

Online version

Expert Identification in Community Question Answering: Exploring Question Selection Bias

Summary

This paper presents the concept of question selection bias as a new measure to study the behavior of users in CQA. This bias provides indications about users' preference to answer questions with respect to completeness, which can be measured by the status (best answer) or number of votes of its answers. The basic finding is that experts tend to pick questions with low existing completeness.

A simple mathematical model is proposed to quantitatively compute the selection bias. Using these bias values as features, the authors apply machine learning (classification) methods to distinguish experts and ordinary users. Experiments with the TurboTax dataset show that selection bias values are superior over other types of features coming from Z-score or text analysis. Mixing up selection bias and text features provides further improvements on the classification performance. Comparison of the classifiers proves that Gaussian classification performs consistently better than linear regression and logistic regression

Dataset

The TurboTax dataset used in this paper has been collected from TurboTax Live Community, a CQA site on preparation of tax returns. Some statistics about the dataset are:

 - Questions 633112  Askers 525143
 - Answers 688390   Answerers 130770
 - 83 experts selected by TurboTax employees 
 - 1367 answerers have provided at least 10 answers

Evaluation

The authors adopt Precision, Recall and F-score as evaluation metrics. The following conclusions arise from their evaluations:

  • CQA experts have the tendency to answer questions with low completeness, which makes their responses more valuable.
  • The selection bias scores modeled in this paper can provide indications about whether an user is a expert. These bias scores are proved to be effective features for identification of CQA experts.
  • On the task of expert identification, Gaussian classification achieves better results than linear regression and logistic regression.
  • Selection bias is not influenced by dynamics of CQA sites, and can be considered as intrinsic characteristics of CQA users.

Discussion

  • This paper falls into the area of expert search, which is an important problem in CQA research. The mathematical model for selection bias computation is pretty straightforward. Also, the authors rely on the commonly-used classifiers for expert identification, rather than come up with more sophisticated approaches. Thus, I would take this paper as an empirical study, whose emphasis is on the interesting findings based on the selection bias concept.
  • However, most of the work is specifically based on the TurboTax dataset, which may limit the application of the approach. For example, TurboTax has the manual expert judgments which are not available in other datasets. In this case, expert identification cannot be translated into a classification problem.

Related papers

Here are two papers related with this work.