Difference between revisions of "Pal et al CIKM 2010"

Latest revision as of 20:55, 3 October 2012

This a Paper discussed in Social Media Analysis 10-802 in Fall 2012.

Citation

Expert Identification in Community Question Answering: Exploring Question Selection Bias. Aditya Pal, Joseph A. Konstan. In Proceedings of CIKM 2010, pages 1505-1508.

Online version

Expert Identification in Community Question Answering: Exploring Question Selection Bias

Summary

This paper presents the concept of question selection bias as a new measure to study the expertise of CQA users. This bias provides indications about users' preference to answer questions with respect to completeness, which can be measured by the status (best answer) or number of votes of its answers. The basic finding is that experts tend to pick questions with low existing completeness.

A simple mathematical model is proposed to quantitatively compute the selection bias. Using these bias values as features, the authors apply machine learning (classification) methods to distinguish experts and ordinary users. Experiments with the TurboTax dataset show that selection bias values are superior over other types of features coming from Z-score or text analysis. Mixing up selection bias and text features provides further improvements on the classification performance. Comparison of the classifiers proves that Gaussian classification performs consistently better than linear regression and logistic regression

Dataset

The TurboTax dataset used in this paper has been collected from TurboTax Live Community, a CQA site on preparation of tax returns. Some statistics about the dataset are:

 - Questions 633112  Askers 525143
 - Answers 688390   Answerers 130770
 - 83 experts selected by TurboTax employees 
 - 1367 answerers have provided at least 10 answers

Evaluation

The authors adopt Precision, Recall and F-score as evaluation metrics. The following conclusions arise from their evaluations:

CQA experts have the tendency to answer questions with low completeness, which makes their responses more valuable.
The selection bias scores modeled in this paper can provide indications about whether an user is a expert. These bias scores are proved to be effective features for identification of CQA experts.
On the task of expert identification, Gaussian classification achieves better results than linear regression and logistic regression.
Selection bias is not influenced by dynamics of CQA sites, and can be considered as intrinsic characteristics of CQA users.

Discussion

+ plus points - minus point

(+) This paper falls into the area of expert search, which is an important problem in CQA research. The authors present interesting observations on selection bias of expert users in CQA. These findings are useful for question recommendation. For example, we should recommend questions with low completeness (few answers) to experts.
(-) The mathematical model for selection bias computation is pretty straightforward. Also, the authors rely on the commonly-used classifiers for expert identification, rather than come up with more sophisticated approaches. Thus, I would take this paper as an empirical study, whose emphasis is on the empirical observations of the selection bias concept.
(-) Most of the work is specifically based on the TurboTax dataset, which may limit the application of the approach. For example, TurboTax has the manual expert judgments which are not available in other datasets. In this case, expert identification cannot be translated into a classification problem.

Related papers

Here are two papers related with this work.

Tapping on the Potential of Q&A Community by Recommending Answer Providers

 - Give a detailed overview of CQA expert search
 - Simulate asking and answering behaviors using using a generative model
 - Perform expert search based on user interests which are represented by latent topics

Knowledge Sharing and Yahoo Answers:Everyone Knows Something

- Also an empirical study, focusing on user interaction and category characteristics
- Study user interests in terms of cross-category entropy, and show that this entropy highly correlates with expertise/rates
- Use the Yahoo Answers dataset, which is commonly used in CQA research

@@ Line 11: / Line 11: @@
 == Summary ==
-This paper presents the concept of question selection bias as a new measure to study the behavior of users in CQA. This bias provides indications about users' preference to answer questions in different stages of completeness. This completeness can be measured with the status (best answer) or number of votes of its answers. The basic finding is that experts tend to pick questions with low existing completeness.
+This [[Category::Paper|paper]] presents the concept of question selection bias as a new measure to study the [[AddressesProblem::Expert Search|expertise of CQA users]]. This bias provides indications about users' preference to answer questions with respect to completeness, which can be measured by the status (best answer) or number of votes of its answers. The basic finding is that experts tend to pick questions with low existing completeness.
-A simple mathematical model is proposed to quantitatively compute the selection bias. Using these bias values as features, the authors apply machine learning (classification) methods to distinguish experts and ordinary users. Experiments with the [[UsesDataset::TurboTax]] dataset show that selection bias values are superior over other types of features coming from Z-score or text analysis. Mixing up selection bias and text features provides further improvements on the classification performance. Comparison of the classifiers prove that [[Gaussian classification]] performs consistently better than [[linear regression]] and [[logistic regression]]
+A simple mathematical model is proposed to quantitatively compute the selection bias. Using these bias values as features, the authors apply machine learning (classification) methods to distinguish experts and ordinary users. Experiments with the [[UsesDataset::TurboTax]] dataset show that selection bias values are superior over other types of features coming from Z-score or text analysis. Mixing up selection bias and text features provides further improvements on the classification performance. Comparison of the classifiers proves that Gaussian classification performs consistently better than [[linear regression]] and [[logistic regression]]
+== Dataset ==
+The [[UsesDataset::TurboTax]] dataset used in this paper has been collected from [https://ttlc.intuit.com/ TurboTax Live Community], a CQA site on preparation of tax returns. Some statistics about the dataset are:
+  - Questions 633112  Askers 525143
+  - Answers 688390   Answerers 130770
+  - 83 experts selected by TurboTax employees
+  - 1367 answerers have provided at least 10 answers
+== Evaluation ==
+The authors adopt [http://en.wikipedia.org/wiki/Precision_and_recall Precision, Recall and F-score] as evaluation metrics. The following conclusions arise from their evaluations:
+* CQA experts have the tendency to answer questions with low completeness, which makes their responses more valuable.
+* The selection bias scores modeled in this paper can provide indications about whether an user is a expert. These bias scores are proved to be effective features for identification of CQA experts.
+* On the task of expert identification, Gaussian classification achieves better results than [[linear regression]] and [[logistic regression]].
+* Selection bias is not influenced by dynamics of CQA sites, and can be considered as intrinsic characteristics of CQA users.
+== Discussion ==
++ plus points    - minus point
+* (+) This paper falls into the area of [[AddressesProblem::Expert Search|expert search]], which is an important problem in CQA research. The authors present interesting observations on selection bias of expert users in CQA. These findings are useful for question recommendation. For example, we should recommend questions with low completeness (few answers) to experts.
+* (-) The mathematical model for selection bias computation is pretty straightforward. Also, the authors rely on the commonly-used classifiers for expert identification, rather than come up with more sophisticated approaches. Thus, I would take this paper as an empirical study, whose emphasis is on the empirical observations of the selection bias concept.
+* (-) Most of the work is specifically based on the [[UsesDataset::TurboTax]] dataset, which may limit the application of the approach. For example, [[UsesDataset::TurboTax]] has the manual expert judgments which are not available in other datasets. In this case, expert identification cannot be translated into a classification problem.
+== Related papers ==
+Here are two papers related with this work.
+* [http://dl.acm.org/citation.cfm?id=1458204 Tapping on the Potential of Q&A Community by Recommending Answer Providers]
+  - Give a detailed overview of CQA expert search
+  - Simulate asking and answering behaviors using using a generative model
+  - Perform expert search based on user interests which are represented by latent topics
+* [http://dl.acm.org/citation.cfm?id=1367587  Knowledge Sharing and Yahoo Answers:Everyone Knows Something]
+ - Also an empirical study, focusing on user interaction and category characteristics
+ - Study user interests in terms of cross-category entropy, and show that this entropy highly correlates with expertise/rates
+ - Use the Yahoo Answers dataset, which is commonly used in CQA research

Difference between revisions of "Pal et al CIKM 2010"

Latest revision as of 20:55, 3 October 2012

Contents

Citation

Online version

Summary

Dataset

Evaluation

Discussion

Related papers

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools