Difference between revisions of "Miller et al ICWSM 2011"
(16 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
+ | This a [[Category::Paper]] reviewed for Social Media Analysis 10-802 in Fall 2012. | ||
+ | |||
== Citation == | == Citation == | ||
author = {Mahalia Miller and | author = {Mahalia Miller and | ||
Line 13: | Line 15: | ||
== Online Version == | == Online Version == | ||
− | http://cs.stanford.edu/people/jure/pubs/sentiflow-icwsm11.pdf | + | [http://cs.stanford.edu/people/jure/pubs/sentiflow-icwsm11.pdf Sentiment Flow Through Hyperlink Networks] |
== Main Idea == | == Main Idea == | ||
− | This paper combines the work done in sentiment analysis of text and graph analysis in order to study the flow of sentiments through a network of blog posts connected by hyperlinks. | + | This paper combines the work done in [[UsesMethod::sentiment analysis]] of text and [[UsesMethod::graph analysis]] in networks in order to study the flow of sentiments through a network of blog posts connected by hyperlinks. |
+ | The work analyzes a large hyper linked network of web blog posts to explore how sentiment features of a post affects the connected posts and the structure of the network. Answers to questions pertaining to the overlap of sentiment and graph analysis have been investigated. | ||
+ | The sentiment of a blog post is affected not only by the sentiment of its immediate parent, but also by the placement of the post within a cascade and the properties of the cascade. | ||
== Dataset == | == Dataset == | ||
The data has been obtained from the [http://memetracker.org/ MemeTracker Project] for the month of August 2010. The dataset consists of roughly 1 million blog posts per day. Each post consists of a URL, time stamp, full text of the post and the list of URLs to the posts it cites. | The data has been obtained from the [http://memetracker.org/ MemeTracker Project] for the month of August 2010. The dataset consists of roughly 1 million blog posts per day. Each post consists of a URL, time stamp, full text of the post and the list of URLs to the posts it cites. | ||
− | The data has pruned to remove singleton posts ( posts which do not link to any other posts). The links to self posts and to the posts outside the data has been removed in order to focus on the flow of sentiments within the network. The dataset used has | + | The data has been pruned to remove singleton posts ( posts which do not link to any other posts). The links to self posts and to the posts outside the data has been removed in order to focus on the flow of sentiments within the network. The dataset used has approximately 8 million blog posts and 15 million hyperlinked edges. |
== Methodology == | == Methodology == | ||
- '''Sentiment Extraction''' | - '''Sentiment Extraction''' | ||
− | + | Each of the web documents has been treated as a bag-of-word model. [http://www.wjh.harvard.edu/~inquirer/ Harvard Inquirer] and [http://sentiwordnet.isti.cnr.it/ SentiWordNet] have been used to obtain the sentiment scores of the individual words in the post. The sentiment score of the entire post is taken to be the average sentiment scores of the words in the document. The sentiment attributes are - positivity, negativity and objectivity of a post. | |
− | + | ||
− | The paper proposes sentiment extraction from emoticon. | + | The paper proposes sentiment extraction from emoticon. The sentiment from emoticon are assumed to binary (+1/-1) and are assigned to the post directly as the frequency of the emoticons in a post. A simple technique is to treat all [:), :D, :P, :p, ;)] to be positive and [:(,D:] to be negative. |
− | The authors define the '''average sentiment of a user''' as the baseline and then computes the '''deviation of the individual posts''' as the polarity of the post. Each domain has been considered as an author and the baseline for the domain has been obtained by averaging over the sentiment of the individual posts. | + | |
+ | The authors define the '''average sentiment of a user''' as the baseline and then computes the '''deviation of the individual posts''' as the polarity of the post. Each web domain has been considered as an author and the baseline sentiment for the domain has been obtained by averaging over the sentiment of the individual posts. | ||
- '''Identification of Cascades and its Topology''' | - '''Identification of Cascades and its Topology''' | ||
Line 35: | Line 40: | ||
A directed edge from ''u'' to ''v'' represents that the post ''u'' contains a hyperlink citing ''v''. The nodes with no outdegrees represents posts which start the flow of the sentiments and are referred as '''cascade initiators'''. | A directed edge from ''u'' to ''v'' represents that the post ''u'' contains a hyperlink citing ''v''. The nodes with no outdegrees represents posts which start the flow of the sentiments and are referred as '''cascade initiators'''. | ||
The topology of a cascade is obtained by applying Breadth-first Search (BFS) from the cascade intiators. | The topology of a cascade is obtained by applying Breadth-first Search (BFS) from the cascade intiators. | ||
− | |||
− | == Findings | + | == Findings == |
+ | |||
+ | The paper explores the flow in the sentiment across hyperlink networks. The main findings of the paper are as follows. | ||
- '''Post Level Analysis''' | - '''Post Level Analysis''' | ||
− | + | * Nodes are strongly influenced by their immediate neighbors. | |
− | + | ** Given an edge from u to v, u is referred to as the parent of v, and v is referred to as the child of u. The analysis in the paper shows that the subjectivity of a child is attributed to the subjectivity of its parent. The usage of subjective language in the parent post leads to higher sentiment score in the child post. | |
− | + | * Emoticon tagging provides a rough heuristic in sentiment analysis, but the bag-of-words model is much richer understanding of the sentiment. | |
− | |||
- '''Cascade Level Analysis''' | - '''Cascade Level Analysis''' | ||
+ | *Sentiment in deeper cascades exhibits 4 distinct phases with time. | ||
+ | ** At the cascade initiator, language is close to the baseline. | ||
+ | ** Positivity and negativity heat up quickly. | ||
+ | ** The sentiments cools off fairly quickly. | ||
+ | ** Returns to the mild baseline. | ||
+ | * Shallow cascades have a mild and short-lived sentiment exhibition. | ||
+ | ** Shallow cascades start off with a slight sentiment support and then dies out quickly. A reasoning for these posts is that they tend to be relatively tame, and so do not attract the attention of more posters. | ||
+ | To conclude, the position of a post in the cascade topology and the overall depth of a cascade plays an important factor in determining the sentiment of a post. | ||
− | + | == Related Work == | |
− | + | * [http://dl.acm.org/citation.cfm?id=1134277 Adamic, L. A., and Glance, N. 2005. The political blogosphere and the 2004 U.S. election. In Proceedings of the 3rd international workshop on Link discovery - LinkKDD ’05, 36–43. ] | |
− | * | + | * [http://dl.acm.org/citation.cfm?id=1557077 Leskovec, J.; Backstrom, L.; and Kleinberg, J. 2009. Memetracking and the dynamics of the news cycle. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’09, 497.] |
− | * | + | *[http://rd.springer.com/article/10.1007/s11280-012-0170-8 Zafarani, R.; Cole, W.; and Liu, H. 2010. Sentiment propagation in social networks: A case study in LiveJournal. In Advances in Social Computing, volume 6007 of Lecture Notes in Computer Science. Springer Berlin / Heidelberg. 413–420. ] |
− | + | *[http://arxiv.org/pdf/0704.2803.pdf Leskovec, J.; McGlohon, M.; Faloutsos, C.; and Glance, N. 2007. Cascading behavior in large blog graphs: Patterns and a model. In SIAM International Conference on Data Mining.] | |
− | |||
− | |||
− | |||
− | * | ||
− | * | ||
− | |||
− | |||
− | |||
− | |||
− | |||
== Study Plan == | == Study Plan == | ||
− | + | * [http://www.wjh.harvard.edu/~inquirer/ Harvard Inquirer] and [http://sentiwordnet.isti.cnr.it/ SentiWordNet ] | |
− | + | ** Softwares for sentiment extraction from words. | |
+ | * [http://snap.stanford.edu/snap/ C++ SNAP library] | ||
+ | ** Library for analysis and manipulation of large graphs. | ||
+ | * [http://en.wikipedia.org/wiki/Cumulative_distribution_function#Complementary_cumulative_distribution_function_.28tail_distribution.29 Complementary Cumulative Distribution Function] |
Latest revision as of 07:49, 27 September 2012
This a Paper reviewed for Social Media Analysis 10-802 in Fall 2012.
Contents
Citation
author = {Mahalia Miller and Conal Sathi and Daniel Wiesenthal and Jure Leskovec and Christopher Potts}, title = {Sentiment Flow Through Hyperlink Networks}, booktitle = {ICWSM}, year = {2011}, ee = {http://www.aaai.org/ocs/index.php/ICWSM/ICWSM11/paper/view/2883}, crossref = {DBLP:conf/icwsm/2011}, bibsource = {DBLP, http://dblp.uni-trier.de}
Online Version
Sentiment Flow Through Hyperlink Networks
Main Idea
This paper combines the work done in sentiment analysis of text and graph analysis in networks in order to study the flow of sentiments through a network of blog posts connected by hyperlinks. The work analyzes a large hyper linked network of web blog posts to explore how sentiment features of a post affects the connected posts and the structure of the network. Answers to questions pertaining to the overlap of sentiment and graph analysis have been investigated. The sentiment of a blog post is affected not only by the sentiment of its immediate parent, but also by the placement of the post within a cascade and the properties of the cascade.
Dataset
The data has been obtained from the MemeTracker Project for the month of August 2010. The dataset consists of roughly 1 million blog posts per day. Each post consists of a URL, time stamp, full text of the post and the list of URLs to the posts it cites. The data has been pruned to remove singleton posts ( posts which do not link to any other posts). The links to self posts and to the posts outside the data has been removed in order to focus on the flow of sentiments within the network. The dataset used has approximately 8 million blog posts and 15 million hyperlinked edges.
Methodology
- Sentiment Extraction
Each of the web documents has been treated as a bag-of-word model. Harvard Inquirer and SentiWordNet have been used to obtain the sentiment scores of the individual words in the post. The sentiment score of the entire post is taken to be the average sentiment scores of the words in the document. The sentiment attributes are - positivity, negativity and objectivity of a post.
The paper proposes sentiment extraction from emoticon. The sentiment from emoticon are assumed to binary (+1/-1) and are assigned to the post directly as the frequency of the emoticons in a post. A simple technique is to treat all [:), :D, :P, :p, ;)] to be positive and [:(,D:] to be negative.
The authors define the average sentiment of a user as the baseline and then computes the deviation of the individual posts as the polarity of the post. Each web domain has been considered as an author and the baseline sentiment for the domain has been obtained by averaging over the sentiment of the individual posts.
- Identification of Cascades and its Topology
The data has been modeled as a graph. Each node represents a blog post, which has its sentiment score as the attribute. A directed edge from u to v represents that the post u contains a hyperlink citing v. The nodes with no outdegrees represents posts which start the flow of the sentiments and are referred as cascade initiators. The topology of a cascade is obtained by applying Breadth-first Search (BFS) from the cascade intiators.
Findings
The paper explores the flow in the sentiment across hyperlink networks. The main findings of the paper are as follows.
- Post Level Analysis
- Nodes are strongly influenced by their immediate neighbors.
- Given an edge from u to v, u is referred to as the parent of v, and v is referred to as the child of u. The analysis in the paper shows that the subjectivity of a child is attributed to the subjectivity of its parent. The usage of subjective language in the parent post leads to higher sentiment score in the child post.
- Emoticon tagging provides a rough heuristic in sentiment analysis, but the bag-of-words model is much richer understanding of the sentiment.
- Cascade Level Analysis
- Sentiment in deeper cascades exhibits 4 distinct phases with time.
- At the cascade initiator, language is close to the baseline.
- Positivity and negativity heat up quickly.
- The sentiments cools off fairly quickly.
- Returns to the mild baseline.
- Shallow cascades have a mild and short-lived sentiment exhibition.
- Shallow cascades start off with a slight sentiment support and then dies out quickly. A reasoning for these posts is that they tend to be relatively tame, and so do not attract the attention of more posters.
To conclude, the position of a post in the cascade topology and the overall depth of a cascade plays an important factor in determining the sentiment of a post.
Related Work
- Adamic, L. A., and Glance, N. 2005. The political blogosphere and the 2004 U.S. election. In Proceedings of the 3rd international workshop on Link discovery - LinkKDD ’05, 36–43.
- Leskovec, J.; Backstrom, L.; and Kleinberg, J. 2009. Memetracking and the dynamics of the news cycle. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’09, 497.
- Zafarani, R.; Cole, W.; and Liu, H. 2010. Sentiment propagation in social networks: A case study in LiveJournal. In Advances in Social Computing, volume 6007 of Lecture Notes in Computer Science. Springer Berlin / Heidelberg. 413–420.
- Leskovec, J.; McGlohon, M.; Faloutsos, C.; and Glance, N. 2007. Cascading behavior in large blog graphs: Patterns and a model. In SIAM International Conference on Data Mining.
Study Plan
- Harvard Inquirer and SentiWordNet
- Softwares for sentiment extraction from words.
- C++ SNAP library
- Library for analysis and manipulation of large graphs.
- Complementary Cumulative Distribution Function