Difference between revisions of "Castillo 2011"
(12 intermediate revisions by the same user not shown) | |||
Line 15: | Line 15: | ||
ee = {http://doi.acm.org/10.1145/1963405.1963500}, | ee = {http://doi.acm.org/10.1145/1963405.1963500}, | ||
} | } | ||
− | |||
== Abstract from the paper == | == Abstract from the paper == | ||
− | + | We analyze the information credibility of news propagated | |
− | + | through Twitter, a popular microblogging service. Previous | |
− | + | research has shown that most of the messages posted on | |
− | that | + | Twitter are truthful, but the service is also used to spread |
− | of | + | misinformation and false rumors, often unintentionally. |
− | + | On this paper we focus on automatic methods for assessing | |
− | + | the credibility of a given set of tweets. Specifically, we | |
− | + | analyze microblog postings related to trending topics, and | |
− | + | classify them as credible or not credible, based on features | |
− | of | + | extracted from them. We use features from users posting |
− | + | and re-posting behavior, from the text of the | |
− | + | posts, and from citations to external sources. | |
− | + | We evaluate our methods using a significant number of | |
− | + | human assessments about the credibility of items on a recent | |
− | + | sample of Twitter postings. Our results shows that there are | |
− | + | measurable differences in the way messages propagate, that | |
− | + | can be used to classify them automatically as credible or | |
− | the | + | not credible. |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
== Online version == | == Online version == | ||
− | |||
[http://www.ra.ethz.ch/cdstore/www2011/proceedings/p675.pdf pdf link to the paper] | [http://www.ra.ethz.ch/cdstore/www2011/proceedings/p675.pdf pdf link to the paper] | ||
== Summary == | == Summary == | ||
− | === | + | === Data Collection=== |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
+ | === Automatic Credibility Analysis === | ||
+ | Four types of features depending on their scope: message-based features, | ||
+ | user-based features, topic-based features, and propagation- | ||
+ | based features. | ||
+ | *'''Message-based features''' consider characteristics of messages, | ||
+ | these features can be Twitter-independent or Twitterdependent. | ||
+ | Twitter-independent features include: the length | ||
+ | of a message, whether or not the text contains exclamation | ||
+ | or question marks and the number of positive/negative sentiment | ||
+ | words in a message. Twitter-dependent features include | ||
+ | features such as: if the tweet contains a hashtag, and | ||
+ | if the message is a re-tweet. | ||
+ | *'''User-based features''' consider characteristics of the users | ||
+ | which post messages, such as: registration age, number of | ||
+ | followers, number of followees (“friends” in Twitter), and the | ||
+ | number of tweets the user has authored in the past. | ||
+ | *'''Topic-based features''' are aggregates computed from the | ||
+ | previous two feature sets; for example, the fraction of tweets | ||
+ | that contain URLs, the fraction of tweets with hashtags and | ||
+ | the fraction of sentiment positive and negative in a set. | ||
+ | *'''Propagation-based features''' consider characteristics related | ||
+ | to the propagation tree that can be built from the retweets | ||
+ | of a message. These includes features such as the | ||
+ | depth of the re-tweet tree, or the number of initial tweets of | ||
+ | a topic. | ||
=== Automatic Assessing Credibility === | === Automatic Assessing Credibility === | ||
Standard machine learning techniques, the best they report is using J48 decision tree. | Standard machine learning techniques, the best they report is using J48 decision tree. | ||
Line 141: | Line 97: | ||
== Related Papers == | == Related Papers == | ||
− | |||
− | |||
− | * | + | |
+ | * [[RelatedPaper::Lin_et_al_KDD_2011|A Statistical Model for Popular Events Tracking in Social Communities. Lin et al, KDD 2011]] This paper address a method to observe and track the popular events or topics that evolve over time in the communities. | ||
+ | * [[RelatedPaper::Yang et al, SIGIR 98|A study on retrospective and online event detection. Yang et al, SIGIR 98]] This paper addresses the problems of detecting events in news stories. | ||
+ | * [[RelatedPaper::Zhao et al, AAAI 07|Temporal and information flow based event detection from social text streams. Zhao et al, AAAI 07]] This paper addresses the problems of detecting events in news stories. | ||
+ | * [[RelatedPaper::Automatic_Detection_and_Classification_of_Social_Events|Automatic Detection and Classification of Social Events. Agarwal and Rambow, ACL 10]] This paper aims at detecting and classifying social events using Tree kernels. | ||
+ | * [[RelatedPaper::Popescu and Pennacchiotti, CIKM 10|Detecting controversial events from Twitter. Popescu and Pennacchiotti, CIKM 10]] This paper addresses the task of identifying controversial events using Twitter as a starting point. |
Latest revision as of 23:39, 8 October 2012
Castillo http://www.ra.ethz.ch/cdstore/www2011/proceedings/p675.pdf
Contents
Citation
@inproceedings{conf/www/CastilloMP11,
author = {Carlos Castillo and Marcelo Mendoza and Barbara Poblete}, title = {Information credibility on twitter}, booktitle = {WWW}, year = {2011}, pages = {675-684}, ee = {http://doi.acm.org/10.1145/1963405.1963500},
}
Abstract from the paper
We analyze the information credibility of news propagated through Twitter, a popular microblogging service. Previous research has shown that most of the messages posted on Twitter are truthful, but the service is also used to spread misinformation and false rumors, often unintentionally. On this paper we focus on automatic methods for assessing the credibility of a given set of tweets. Specifically, we analyze microblog postings related to trending topics, and classify them as credible or not credible, based on features extracted from them. We use features from users posting and re-posting behavior, from the text of the posts, and from citations to external sources. We evaluate our methods using a significant number of human assessments about the credibility of items on a recent sample of Twitter postings. Our results shows that there are measurable differences in the way messages propagate, that can be used to classify them automatically as credible or not credible.
Online version
Summary
Data Collection
Automatic Credibility Analysis
Four types of features depending on their scope: message-based features, user-based features, topic-based features, and propagation- based features.
- Message-based features consider characteristics of messages,
these features can be Twitter-independent or Twitterdependent. Twitter-independent features include: the length of a message, whether or not the text contains exclamation or question marks and the number of positive/negative sentiment words in a message. Twitter-dependent features include features such as: if the tweet contains a hashtag, and if the message is a re-tweet.
- User-based features consider characteristics of the users
which post messages, such as: registration age, number of followers, number of followees (“friends” in Twitter), and the number of tweets the user has authored in the past.
- Topic-based features are aggregates computed from the
previous two feature sets; for example, the fraction of tweets that contain URLs, the fraction of tweets with hashtags and the fraction of sentiment positive and negative in a set.
- Propagation-based features consider characteristics related
to the propagation tree that can be built from the retweets of a message. These includes features such as the depth of the re-tweet tree, or the number of initial tweets of a topic.
Automatic Assessing Credibility
Standard machine learning techniques, the best they report is using J48 decision tree.
Results:
Results for the credibility classification.
Class TP_Rate FP_Rate Prec. Recall F1
A (“true”) 0.825 0.108 0.874 0.825 0.849
B (“false”) 0.892 0.175 0.849 0.892 0.87
W. Avg. 0.860 0.143 0.861 0.860 0.86
Feature Level Analysis
Top feature that contribute more on deciding credibility:
- Tweets having an URL is the root of the tree.
- Sentiment-based feature like fraction of negative sentiment
- Low credibility news are mostly propagated by users who have not written many message in the past
Interesting Aspect
I like the coding scheme of this paper. It is reasonable and comprehensive. Some of the conclusion that drew from the paper is interesting to look at. For example
- Among several other features, newsworthy topics tend to include URLs and to have deep propagation trees
- Among several other features, credible news are propagated through authors that have previously written a large number of messages, originate
at a single or a few users in the network, and have many re-posts.
Related Papers
- A Statistical Model for Popular Events Tracking in Social Communities. Lin et al, KDD 2011 This paper address a method to observe and track the popular events or topics that evolve over time in the communities.
- A study on retrospective and online event detection. Yang et al, SIGIR 98 This paper addresses the problems of detecting events in news stories.
- Temporal and information flow based event detection from social text streams. Zhao et al, AAAI 07 This paper addresses the problems of detecting events in news stories.
- Automatic Detection and Classification of Social Events. Agarwal and Rambow, ACL 10 This paper aims at detecting and classifying social events using Tree kernels.
- Detecting controversial events from Twitter. Popescu and Pennacchiotti, CIKM 10 This paper addresses the task of identifying controversial events using Twitter as a starting point.