Adamic et al Knowledge sharing and yahoo answers WWW'08

From Cohen Courses
Jump to navigationJump to search

Citation

Lada A. Adamic, Jun Zhang, Eytan Bakshy, and Mark S. Ackerman. 2008. Knowledge sharing and yahoo answers: everyone knows something. In Proceeding of the 17th international conference on World Wide Web (WWW '08). ACM, New York, NY, USA, 665-674.

Online version

link

Summary

Yahoo Answers (YA) is a question-answer forum which is used not only for sharing technical knowledge but also to seek advice, gather opinions etc. This paper address the Question-Answer Sharing Analysis. The paper analyzes the forum categories and clusters them according to content characteristics and pattern of interaction among users. Some forum categories resemble expertise sharing forums; others incorporate discussion, everyday advice and support. Some user participates in different kind of topics while some are focused on very narrow topics. The entropy of use interests is also measured and its relation with the answer quality is analyzed. In the end a simple method is also proposed to predict the best answer of a question.

Description of the method

  • Basic Characteristics

The categories are indirectly inferred by observing attributes like average thread length (the number of replies per post) and average post length (how verbose answers are).

Authors observed that factual questions received less number of replies, but those replies were relatively lengthy. Other categories are discussion categories which attracts many replies of moderate lengths.

Other characteristics for categories are asker / replier overlap. Categories like technical expertise, in which few experts answers the questions while novice ask questions has less overlap. On the other hand categories related to discussion and advice has higher number of overlap.

Most active categories (> 1000 posted questions) was clustered using K-Means using three features i.e. Thread length, content length and asker/replier overlap. The three clusters consist of discussion forums (high proportion of users who both pose and answer questions), advice & common sense expertise and factual answers.

By connecting users who ask questions to users who answer them, an asker-replier graph was created. The analysis of this graph was done as below:

    • Degree Distribution

The indegree and outdegree graph follow the power law. Users differ in their activity level: some answer many questions; some merely stop by to ask or answer a question or two. The author also compared the graphs for different categories.

    • Analysis of Ego Networks

In discussion categories, the neighbors of some highly active users are also highly connected themselves. This shows that they are more likely to be “discussion person”. It is not the case for the programming category.

    • Strongly Connected Component

Wrestling category has larger strongly connected component which shows that user don’t have specific role in their network while in programming category there is almost no strongly connected component which shows the separation of user roles into “helpers” and “askers”.

    • Motif Analysis

This analysis try to see how many interactions are reciprocal (the asker become replier for another question) and how often the triads are complete (three users who have all replied to each other). Wrestling and Marriage category has high number of reciprocal links and tirads.

  • Expertise Depth

Some random 100 questions were manually classified based on the five levels of expertise needed to answer that question. Authors found that programming category only has 1% questions required expertise above level 3. So the questions are very shallow in YA.

  • Relationship between Categories

Relationship between two categories is analyzed by estimating the number of common users who participate in both categories. Computer centered categories were clustered together and politics and Government based categories were clustered together.

  • User Entropy

The entropy of a user is calculated based on the different categories he / she participate. Entropy is calculated such that hierarchical organization of the categories is preserved. The distribution of users based on the entropy is flat.

The distribution of users based on the best answer is turn out to be skewed.

  • Correlation between user focus and best answer

It is expected that users having low entropy will get more number of best answers because of their focus / expertise of the field. Contrary to this expectation, authors did not find any significant correlation between these.

But correlation was there in case of factual categories. Authors calculated the second level user entropy for technical categories. Authors found significant correlation between user’s entropy within those categories.

Authors also found correlation between user’s answers in a category and user’s proportion of best answer in that category across all of YA.

  • Predicting Best Answer

To predict best answer, logistic regression was used using reply length, thread length, number of user’s best answers and number of user’s replies as features.

Datasets used

The paper used the data from YA’s one month activity. The dataset includes 8,452,337 answers to 1,178,983 questions, with 433,402 unique repliers and 495,414 unique askers. Of those users, 211,372 both asked and replied. The questions are posted in 25 top-level and 1002 (continually expanding) lower level categories.

Experimental Results

For all categories, the length of reply and the number of other answers the asker had to choose from were the two most significant features. Programming, Marriage and Wrestling got 72.9%, 69.3% and 69.2% accuracy respectively for predicting best answer.

Discussion

This paper propose some new features which could be used in analyzing social media content. Authors contrasted content properties and social network interactions across different YA categories (or topics). They found that we could cluster the categories according to thread length and overlap between the set of users who asked and those who replied. Authors identified related categories, by asking whether a user who answers questions in one category is also likely to answer in another. Authors also attempted to predict best answers based on attributes of the question and the replier. The results showed that just the very basic metric of reply length, along with the number of competing answers, and the track record of the user, was most predictive of whether the answer would be selected.