Difference between revisions of "Pal et al CIKM 2010"
(Created page with 'This a [[Category::Paper]] discussed in Social Media Analysis 10-802 in Fall 2012. == Citation == Expert Identification in Community Question Answering: Exploring Question Sel…') |
|||
(15 intermediate revisions by the same user not shown) | |||
Line 11: | Line 11: | ||
== Summary == | == Summary == | ||
− | This paper presents the concept of question selection bias as a new measure to study the | + | This [[Category::Paper|paper]] presents the concept of question selection bias as a new measure to study the [[AddressesProblem::Expert Search|expertise of CQA users]]. This bias provides indications about users' preference to answer questions with respect to completeness, which can be measured by the status (best answer) or number of votes of its answers. The basic finding is that experts tend to pick questions with low existing completeness. |
− | A simple mathematical model is proposed to quantitatively compute the selection bias. Using these bias values as features, the authors apply machine learning (classification) methods to distinguish experts and ordinary users. Experiments with the [[UsesDataset::TurboTax]] dataset show that selection bias values are superior over other types of features coming from Z-score or text analysis. Mixing up selection bias and text features provides further improvements on the classification performance. Comparison of the classifiers | + | A simple mathematical model is proposed to quantitatively compute the selection bias. Using these bias values as features, the authors apply machine learning (classification) methods to distinguish experts and ordinary users. Experiments with the [[UsesDataset::TurboTax]] dataset show that selection bias values are superior over other types of features coming from Z-score or text analysis. Mixing up selection bias and text features provides further improvements on the classification performance. Comparison of the classifiers proves that Gaussian classification performs consistently better than [[linear regression]] and [[logistic regression]] |
+ | |||
+ | == Dataset == | ||
+ | |||
+ | The [[UsesDataset::TurboTax]] dataset used in this paper has been collected from [https://ttlc.intuit.com/ TurboTax Live Community], a CQA site on preparation of tax returns. Some statistics about the dataset are: | ||
+ | - Questions 633112 Askers 525143 | ||
+ | - Answers 688390 Answerers 130770 | ||
+ | - 83 experts selected by TurboTax employees | ||
+ | - 1367 answerers have provided at least 10 answers | ||
+ | |||
+ | == Evaluation == | ||
+ | |||
+ | The authors adopt [http://en.wikipedia.org/wiki/Precision_and_recall Precision, Recall and F-score] as evaluation metrics. The following conclusions arise from their evaluations: | ||
+ | * CQA experts have the tendency to answer questions with low completeness, which makes their responses more valuable. | ||
+ | * The selection bias scores modeled in this paper can provide indications about whether an user is a expert. These bias scores are proved to be effective features for identification of CQA experts. | ||
+ | * On the task of expert identification, Gaussian classification achieves better results than [[linear regression]] and [[logistic regression]]. | ||
+ | * Selection bias is not influenced by dynamics of CQA sites, and can be considered as intrinsic characteristics of CQA users. | ||
+ | |||
+ | == Discussion == | ||
+ | + plus points - minus point | ||
+ | |||
+ | * (+) This paper falls into the area of [[AddressesProblem::Expert Search|expert search]], which is an important problem in CQA research. The authors present interesting observations on selection bias of expert users in CQA. These findings are useful for question recommendation. For example, we should recommend questions with low completeness (few answers) to experts. | ||
+ | * (-) The mathematical model for selection bias computation is pretty straightforward. Also, the authors rely on the commonly-used classifiers for expert identification, rather than come up with more sophisticated approaches. Thus, I would take this paper as an empirical study, whose emphasis is on the empirical observations of the selection bias concept. | ||
+ | * (-) Most of the work is specifically based on the [[UsesDataset::TurboTax]] dataset, which may limit the application of the approach. For example, [[UsesDataset::TurboTax]] has the manual expert judgments which are not available in other datasets. In this case, expert identification cannot be translated into a classification problem. | ||
+ | |||
+ | == Related papers == | ||
+ | Here are two papers related with this work. | ||
+ | * [http://dl.acm.org/citation.cfm?id=1458204 Tapping on the Potential of Q&A Community by Recommending Answer Providers] | ||
+ | - Give a detailed overview of CQA expert search | ||
+ | - Simulate asking and answering behaviors using using a generative model | ||
+ | - Perform expert search based on user interests which are represented by latent topics | ||
+ | * [http://dl.acm.org/citation.cfm?id=1367587 Knowledge Sharing and Yahoo Answers:Everyone Knows Something] | ||
+ | - Also an empirical study, focusing on user interaction and category characteristics | ||
+ | - Study user interests in terms of cross-category entropy, and show that this entropy highly correlates with expertise/rates | ||
+ | - Use the Yahoo Answers dataset, which is commonly used in CQA research |
Latest revision as of 20:55, 3 October 2012
This a Paper discussed in Social Media Analysis 10-802 in Fall 2012.
Citation
Expert Identification in Community Question Answering: Exploring Question Selection Bias. Aditya Pal, Joseph A. Konstan. In Proceedings of CIKM 2010, pages 1505-1508.
Online version
Expert Identification in Community Question Answering: Exploring Question Selection Bias
Summary
This paper presents the concept of question selection bias as a new measure to study the expertise of CQA users. This bias provides indications about users' preference to answer questions with respect to completeness, which can be measured by the status (best answer) or number of votes of its answers. The basic finding is that experts tend to pick questions with low existing completeness.
A simple mathematical model is proposed to quantitatively compute the selection bias. Using these bias values as features, the authors apply machine learning (classification) methods to distinguish experts and ordinary users. Experiments with the TurboTax dataset show that selection bias values are superior over other types of features coming from Z-score or text analysis. Mixing up selection bias and text features provides further improvements on the classification performance. Comparison of the classifiers proves that Gaussian classification performs consistently better than linear regression and logistic regression
Dataset
The TurboTax dataset used in this paper has been collected from TurboTax Live Community, a CQA site on preparation of tax returns. Some statistics about the dataset are:
- Questions 633112 Askers 525143 - Answers 688390 Answerers 130770 - 83 experts selected by TurboTax employees - 1367 answerers have provided at least 10 answers
Evaluation
The authors adopt Precision, Recall and F-score as evaluation metrics. The following conclusions arise from their evaluations:
- CQA experts have the tendency to answer questions with low completeness, which makes their responses more valuable.
- The selection bias scores modeled in this paper can provide indications about whether an user is a expert. These bias scores are proved to be effective features for identification of CQA experts.
- On the task of expert identification, Gaussian classification achieves better results than linear regression and logistic regression.
- Selection bias is not influenced by dynamics of CQA sites, and can be considered as intrinsic characteristics of CQA users.
Discussion
+ plus points - minus point
- (+) This paper falls into the area of expert search, which is an important problem in CQA research. The authors present interesting observations on selection bias of expert users in CQA. These findings are useful for question recommendation. For example, we should recommend questions with low completeness (few answers) to experts.
- (-) The mathematical model for selection bias computation is pretty straightforward. Also, the authors rely on the commonly-used classifiers for expert identification, rather than come up with more sophisticated approaches. Thus, I would take this paper as an empirical study, whose emphasis is on the empirical observations of the selection bias concept.
- (-) Most of the work is specifically based on the TurboTax dataset, which may limit the application of the approach. For example, TurboTax has the manual expert judgments which are not available in other datasets. In this case, expert identification cannot be translated into a classification problem.
Related papers
Here are two papers related with this work.
- Give a detailed overview of CQA expert search - Simulate asking and answering behaviors using using a generative model - Perform expert search based on user interests which are represented by latent topics
- Also an empirical study, focusing on user interaction and category characteristics - Study user interests in terms of cross-category entropy, and show that this entropy highly correlates with expertise/rates - Use the Yahoo Answers dataset, which is commonly used in CQA research