Agichtein et al Finding High-Quality Content in Social Media WSDM’08

From Cohen Courses
Jump to navigationJump to search

Citation

Eugene Agichtein, Carlos Castillo, Debora Donato, Aristides Gionis, and Gilad Mishne. 2008. Finding high-quality content in social media. In Proceedings of the international conference on Web search and web data mining (WSDM '08). ACM, New York, NY, USA, 183-194.

Online version

PDF

Summary

The quality of user generated content varies drastically from excellent to abuse and spam. This paper discusses different kind of features which can be exploited to automatically identifying high-quality content in social media sites . In addition to the content there, paper also proposes to use other non-content information like links between items and explicit quality rating from members of the community. Authors experimented on the Yahoo! Answers service and focused on two tasks i.e. identifying high quality questions and identifying high quality answers.

Description of the method

The author proposed mainly three kinds of features.

  • Intrinsic Content Quality
    • All words n-gram up to length 5 that appear in the collection more than 3 times
    • Punctuation and Typos
      • Punctuation
      • Capitalization
      • Spacing Density (% of all characters)
      • Character level entropy of text
      • Number of spelling mistakes
      • Number of out-of-vocabulary words
    • Syntactic and Semantic Complexity
      • Average number of syllables per word
      • Entropy of word lengths
    • Grammaticality
      • POS tag n-grams
      • Formality Score
      • Distance between trigram language model and Wikipedia / Yahoo Answers language models
  • User relationships
    • User-Item (Questions and Answers) Graph
      • Edge represents type of relationship like “User U answers question Q”
    • User-User Graph
      • Edges represent implicit relationships like “User U answered a question from user V”
  • Usage Statistics
    • Number of clicks on the item
    • Dwell time

The point to consider in using Usage statistics is that popular category is expected to get more clicks than other categories. So instead of using number of clicks, difference between number of clicks and expected number of clicks for that category is used.

Datasets used

The dataset is from Yahoo Answer consist of 6,665 questions and 8,366 question/answer pairs. Every question/answers was labeled for quality by human editors graded questions and answers for well-formedness, readability, utility and interestingness. For answers correctness was also a measure. Also, high level type (informational, advice poll etc.) was assigned to each question.

Experimental Results

  • Question Quality

The most significant features for question quality classification, according to a chi-squared test are as below:

      • Average number of "stars" to questions by the same asker.
      • The punctuation density in the question's subject.
      • The question's category (assigned by the asker).
      • Normalized Clickthrough: The number of clicks on the question thread, normalized by the average number of clicks for all questions in its category.
      • Average number of "Thumbs up" received by answers written by the asker of the current question.
      • Number of words per sentence.
      • Average number of answers with references (URLs) given by the asker of the current question.
      • Fraction of questions asked by the asker in which he opens the question's answers to voting (instead of picking the best answer by hand).
      • Average length of the questions by the asker.
      • The number of “best answers” authored by the user.
      • The number of days the user was active in the system.
      • “Thumbs up” received by the answers wrote by the asker of the current question, minus “thumbs down”, divided by total number of “thumbs” received.
      • Clicks over Views: The number of clicks on a question thread divided by the number of times the question thread was retrieved as a search result.
      • The KL-divergence between the question's language model and a model estimated from a collection of question answered by the Yahoo editorial team.
      • The fraction of words that are not in the list of the top-10 words in the collection, ranked by frequency.
      • The number of “capitalization errors” in the question (e.g., sentence not starting with a capitalized word).
      • The number of days that has passed since the asker wrote his/her first question or answer in the system.
      • The total number of answers of the asker that have been selected as the “best answer”.
      • The number of questions that the asker has asked in its most active category, over the total number of questions that the asker has asked.
      • The entropy of the part-of-speech tags of the question
  • Answer Quality

The most significant features for answer quality classification, according to a chi-squared test are as below:

      • Answer length.
      • The number of words in the answer with a corpus frequency larger than c.
      • The number of “thumbs up” minus “thumbs down” received by the answerer, divided by the total number of “thumbs” s/he has received.
      • The entropy of the trigram character-level model of the answer.
      • The fraction of answers of the answerer that have been picked as best answers (either by the askers of such questions, or by a community voting).
      • The unique number of words in the answer.
      • Average number of abuse reports received by the answerer over all his/her questions and answers.
      • Average number of abuse reports received by the answerer over his/her answers.
      • The non-stopword word overlap between the question and the answer.
      • The Kincaid score of the answer.
      • The average number of answers received by the questions asked by the asker of this answer.
      • The ratio between the length of the question and the length of the answer.
      • The number of “thumbs up” minus “thumbs down” received by the answerer.
      • The average numbers of “thumbs” received by the answers to other questions asked by the asker of this answer.
      • The entropy of the unigram character-level model of the answer.
      • The KL-divergence between the answer's language model and a model estimated from the Wikipedia discussion pages.
      • Number of abuse reports received by the asker of the question being answered.
      • The sum of the lengths of all the answers received by the asker of the question being answered.
      • The sum of the “thumbs down” received by the answers received by the asker of the question being answered.
      • The average number of answers with votes in the questions asked by the asker of the question being answered.

Discussion

By using features proposed in the paper, authors got 76.1% accuracy for question quality prediction task and 87.3% accuracy for answer quality prediction task. The results are comparatively good and this also leads to some good features for this kind of task. The paper also demonstrates that quality of the social media content not only depends on the content but also on the user relationships and usage statistics.