Castillo 2011

From Cohen Courses
Revision as of 22:05, 1 October 2012 by Zhouyu (talk | contribs)
Jump to navigationJump to search

Castillo http://www.ra.ethz.ch/cdstore/www2011/proceedings/p675.pdf


Citation

@inproceedings{conf/www/CastilloMP11,

 author    = {Carlos Castillo and
              Marcelo Mendoza and
              Barbara Poblete},
 title     = {Information credibility on twitter},
 booktitle = {WWW},
 year      = {2011},
 pages     = {675-684},
 ee        = {http://doi.acm.org/10.1145/1963405.1963500},

}


Castillo http://delivery.acm.org/10.1145/320000/312190/p33-guralnik.pdf?ip=128.237.122.250&acc=ACTIVE%20SERVICE&CFID=119212228&CFTOKEN=52277574&__acm__=1348531826_377333b00daa1db4fd36cb60f6bb28fb



Abstract from the paper

Online version

Summary

Data Collection

Automatic Credibility Analysis

Four types of features depending on their scope: message-based features, user-based features, topic-based features, and propagation- based features.

  • Message-based features consider characteristics of messages,

these features can be Twitter-independent or Twitterdependent. Twitter-independent features include: the length of a message, whether or not the text contains exclamation or question marks and the number of positive/negative sentiment words in a message. Twitter-dependent features include features such as: if the tweet contains a hashtag, and if the message is a re-tweet.

  • User-based features consider characteristics of the users

which post messages, such as: registration age, number of followers, number of followees (“friends” in Twitter), and the number of tweets the user has authored in the past.

  • Topic-based features are aggregates computed from the

previous two feature sets; for example, the fraction of tweets that contain URLs, the fraction of tweets with hashtags and the fraction of sentiment positive and negative in a set.

  • Propagation-based features consider characteristics related

to the propagation tree that can be built from the retweets of a message. These includes features such as the depth of the re-tweet tree, or the number of initial tweets of a topic.

Automatic Assessing Credibility

Standard machine learning techniques, the best they report is using J48 decision tree.

Results:

Results for the credibility classification.

Class TP_Rate FP_Rate Prec. Recall F1

A (“true”) 0.825 0.108 0.874 0.825 0.849

B (“false”) 0.892 0.175 0.849 0.892 0.87

W. Avg. 0.860 0.143 0.861 0.860 0.86


Feature Level Analysis

Top feature that contribute more on deciding credibility:

  • Tweets having an URL is the root of the tree.
  • Sentiment-based feature like fraction of negative sentiment
  • Low credibility news are mostly propagated by users who have not written many message in the past

Interesting Aspect

I like the coding scheme of this paper. It is reasonable and comprehensive. Some of the conclusion that drew from the paper is interesting to look at. For example

  • Among several other features, newsworthy topics tend to include URLs and to have deep propagation trees
  • Among several other features, credible news are propagated through authors that have previously written a large number of messages, originate

at a single or a few users in the network, and have many re-posts.

Related Papers

  • T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes Twitter users: real-time event detection by social sensors.

In Proceedings of the 19th international conference on World wide web, WWW ’10, pages 851–860, New York, NY, USA, April 2010. ACM

  • J. Sankaranarayanan, H. Samet, B. E. Teitler, M. D.Lieberman, and J. Sperling. TwitterStand: news in tweets. In GIS ’09: Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 42–51, New York, NY, USA, November 2009. ACM Press.