Difference between revisions of "Project 2nd draft Derry Reyyan"

From Cohen Courses
Jump to navigationJump to search
(Created page with 'Social Media Analysis Project Ideas == Team Members == Derry Wijaya [dwijaya@cs.cmu.edu] Reyyan Yeniterzi [reyyan@cs.cmu.edu] == Project Idea…')
 
 
(34 intermediate revisions by the same user not shown)
Line 11: Line 11:
 
'''Understanding change'''
 
'''Understanding change'''
  
Given an entity of interest, we would like to model and analyze its change (in terms of words and phrases that co-occur with it) over time.  
+
Given an entity of interest, we would like to model and analyze its change in terms of words and phrases that co-occur with it over time. In other words, we would like to understand the change in [[AddressesProblem::Social Network Attribute]] over time, where the social network is defined over words.  
  
We propose to construct a social graph, but instead of people, we put words as nodes and edges are weighted based on number of co-occurrence between the words. Using this social graph of words, we propose to analyze:  
+
We propose to construct a social network, but instead of people, we put words as nodes and edges are weighted based on the number of co-occurrences between the words. Using this social network of words, we propose to analyze:  
  
How co-occurrence with other words influences the meaning or the sentiment associated with the word. For example, the word 'BP' frequently co-occurred with negatively associated words during and after the Gulf-spill event.
+
(1) how co-occurrence with other words change over time
 +
 
 +
(2) how the change influences the state (semantic or sentiment) associated with the entity
 +
 
 +
(3) how the change may correspond to events that occur during the same period of time
 +
 
 +
For example, the entity 'BP' frequently co-occurred with negatively associated words during and after the Gulf-spill event.
  
 
== Dataset ==
 
== Dataset ==
Line 23: Line 29:
 
== Motivation ==
 
== Motivation ==
  
For each of the ideas above, our motivations are (in order of the ideas):
+
The co-occurrence of words changes over time
 +
 
 +
[[File:Obama.png]]
 +
 
 +
It will be interesting to model this change and analyze:
 +
 
 +
(1) How the state (semantic or sentiment) of a given entity changes over time depending on its neighbors (i.e. co-occurring words/phrases)
 +
 
 +
(2) How such changes relate to events that occur in the same period of time
 +
 
 +
(3) Whether we can find a natural sequence of events that define a change of state (semantic or sentiment) of a given entity
 +
 
 +
(4) Whether we can use (3) to predict the change of state of a given entity
 +
 
 +
== Techniques and Related Works ==
  
• It will be interesting to find out how an event reported in a news article can change a blogger's opinion on the related topic. How often bloggers start writing about a topic for the first time after reading about a related event in the news?
+
(1) [[UsesMethod::Linear regression]] analysis to measure tendency of a word to become negative/positive in meaning over time, when co-occurred with negative/positive words
  
• It will be interesting to find out whether centrality and betweenness apply to a graph of opinions. A graph can be constructed where each node is a piece of opinion and the edges are similarities between the opinions. Can we then find in the graph, which opinion(s) is(are) the ringleaders? Are there neutral or indecisive opinions that act as go-between between different groups of opinions? How cohesive are the groups of opinions? How does the graph change overtime? Are there spatial segregation in the graph (where minority opinions) are pushed to the periphery of the graph?
+
Related paper: [[RelatedPaper::Nicholas A. Christakis, M.D., Ph.D., M.P.H., and James H. Fowler, Ph.D. (2007) The Spread of Obesity in a Large Social Network over 32 Years]]: [http://www.nejm.org/doi/pdf/10.1056/NEJMsa066082 External Link] - techniques to analyze the spread of obesity in a network of people
  
• It will be interesting to find out whether homophily occurs in words. If a word starts to 'hang out' (tend to co-occur) with negatively associated words, will its semantic and usage become negative? (social contagion) Do negative words tend to co-occur together? (associative sorting). How does the semantic of a word change depending on its neighbor (i.e. co-occurring words)?
+
(2) Identify break points in the states of an entity (based on its co-occurrence changes) and find events that correspond to the break points
  
• It will be interesting to do opinion mining on Twitter data, to find out whether follower/following links have an influence in the spread of opinions in Twitter; or if people from the same Geo-location will tend to have the same opinions. Another interesting thing is to find out whether we can predict whether a person will become a follower of/be followed by another person based on similarity of their follower/following links, similarity of opinions, temporal-coincidence of the opinions, and geographic coincidence: i.e. whether two persons with '''''a''''' similar followers, who follow '''''b''''' similar people, who has '''''c''''' degree of opinion similarity, who voice their opinions within '''''d''''' days of each other, and who are located in '''''e''''' geographical distance apart are likely to follow/be-followed by one another?
+
Related paper: [[RelatedPaper::Akcora et.al. (2010) Identifying Breakpoints in Public Opinion]]: [http://snap.stanford.edu/soma2010/papers/soma2010_9.pdf External Link] - technique to identify break points in sentiments found in tweets (Twitter), using a set of manually constructed emotion words ([[UsesMethod::Vector space models]])
  
== Techniques ==
+
Related paper: [[RelatedPaper::Michel et.al. (2010) Quantitative Analysis of Culture Using Millions of Digitized Books]]: [http://www.sciencemag.org/content/early/2010/12/15/science.1199644 External Link] - measure usage frequency over time of a given n-gram (such as "slavery", "great war", etc) that represents an entity of interest
  
For each of the ideas above, proposed techniques or related papers are (in order of the ideas):
+
(3) Graph modeling of co-occurrences (where nodes are words and edges are weighted by the number of co-occurrences between words). Use of techniques from dynamic network evolution, link analysis or [[UsesMethod::clustering]] to model and analyze changes in this graph over time
  
Clustering of opinions. Finding when a group of opinions break into two in time (to detect the time '''''t''''' where a change in opinion occurs, followed by the grow of another group of opinion cluster). Topic modeling of news document to pinpoint the particular event at that time '''''t''''' that may cause the change. Related recent paper: [http://upinion.cse.buffalo.edu/beta/SOMApaper.pdf Identifying Breakpoints in Public Opinion].
+
Related paper: [[RelatedPaper::Cemal Cagatay Bilgin and Bülent Yener (2010) Dynamic Network Evolution: Models, Clustering, Anomaly Detection]]: [http://www.cs.rpi.edu/research/pdf/08-08.pdf External Link]
  
• Using centrality and betweenness measures in social network analysis, but applied to a network of opinions (Related paper: [http://onlinelibrary.wiley.com/doi/10.1002/asi.20614/pdf Betweenness Centrality as an Indicator of the Interdisciplinarity of Scientific Journals]). Random walk on the graph to find ring leaders and clusters of opinions. Schelling segregation to measure spatial segregation (we first need to define what 'space' means in the graph of opinions). A related paper to segregation in graph is [http://www.nejm.org/doi/pdf/10.1056/NEJMsa0706154 The Collective Dynamics of Smoking in a Large Social Network].
+
(4) For baseline, we plan to use [[UsesMethod::Bayes' Law]] to measure probability that a word will co-occur with an entity, given the entity and other words that have co-occurred with the entity:
  
• Regression analysis to measure tendency of a word to become negative in meaning over time, when co-occurred with negative words (Related paper: [http://www.nejm.org/doi/pdf/10.1056/NEJMsa066082 The Spread of Obesity in a Large Social Network over 32 Years] - applied to measuring the spread of negativity in a network of words).  
+
:<math>p(word_i \vert entity, word_1,\dots,word_{i-1}) = \frac{p(word_i) \ p(entity,word_1,\dots,word_{i-1}\vert word_i)}{p(entity,word_1,\dots,word_{i-1})}. \,</math>
  
Using Bayes rule to measure probability of two people having a link in Twitter based on their friends links and opinions and spatial-temporal overlap. An interesting relation to a recent paper [http://www.pnas.org/content/early/2010/12/02/1006155107.full.pdf Inferring social ties from geographic coincidences].
+
Using [[UsesMethod::Naive Bayes]] conditional independence assumption,
  
== Evaluation ==
+
:<math>p(word_i \vert entity,word_1,\dots,word_{i-1}) = \frac{1}{Z}  p(word_i) p(entity|word_i) \prod_{j=1}^{i-1} p(word_j \vert word_i)</math>
  
A combination of manual evaluation and cross validation (splitting the data into training and testing and evaluate) may be done.  
+
where <math>Z</math> (the evidence) is the normalization factor. We then need to model how this probability changes over time.
  
== Superpowers ==
+
== Evaluation ==
  
• Nothing really at the moment, except for a bag full of ideas and a lot of keenness in pursuing at least one of them well.
+
Our project will be mainly quantitative analysis in nature.

Latest revision as of 20:25, 14 February 2011

Social Media Analysis Project Ideas

Team Members

Derry Wijaya [dwijaya@cs.cmu.edu]

Reyyan Yeniterzi [reyyan@cs.cmu.edu]

Project Idea

Understanding change

Given an entity of interest, we would like to model and analyze its change in terms of words and phrases that co-occur with it over time. In other words, we would like to understand the change in Social Network Attribute over time, where the social network is defined over words.

We propose to construct a social network, but instead of people, we put words as nodes and edges are weighted based on the number of co-occurrences between the words. Using this social network of words, we propose to analyze:

(1) how co-occurrence with other words change over time

(2) how the change influences the state (semantic or sentiment) associated with the entity

(3) how the change may correspond to events that occur during the same period of time

For example, the entity 'BP' frequently co-occurred with negatively associated words during and after the Gulf-spill event.

Dataset

Google Books Ngram Data.

Motivation

The co-occurrence of words changes over time

Obama.png

It will be interesting to model this change and analyze:

(1) How the state (semantic or sentiment) of a given entity changes over time depending on its neighbors (i.e. co-occurring words/phrases)

(2) How such changes relate to events that occur in the same period of time

(3) Whether we can find a natural sequence of events that define a change of state (semantic or sentiment) of a given entity

(4) Whether we can use (3) to predict the change of state of a given entity

Techniques and Related Works

(1) Linear regression analysis to measure tendency of a word to become negative/positive in meaning over time, when co-occurred with negative/positive words

Related paper: Nicholas A. Christakis, M.D., Ph.D., M.P.H., and James H. Fowler, Ph.D. (2007) The Spread of Obesity in a Large Social Network over 32 Years: External Link - techniques to analyze the spread of obesity in a network of people

(2) Identify break points in the states of an entity (based on its co-occurrence changes) and find events that correspond to the break points

Related paper: Akcora et.al. (2010) Identifying Breakpoints in Public Opinion: External Link - technique to identify break points in sentiments found in tweets (Twitter), using a set of manually constructed emotion words (Vector space models)

Related paper: Michel et.al. (2010) Quantitative Analysis of Culture Using Millions of Digitized Books: External Link - measure usage frequency over time of a given n-gram (such as "slavery", "great war", etc) that represents an entity of interest

(3) Graph modeling of co-occurrences (where nodes are words and edges are weighted by the number of co-occurrences between words). Use of techniques from dynamic network evolution, link analysis or clustering to model and analyze changes in this graph over time

Related paper: Cemal Cagatay Bilgin and Bülent Yener (2010) Dynamic Network Evolution: Models, Clustering, Anomaly Detection: External Link

(4) For baseline, we plan to use Bayes' Law to measure probability that a word will co-occur with an entity, given the entity and other words that have co-occurred with the entity:

Using Naive Bayes conditional independence assumption,

where (the evidence) is the normalization factor. We then need to model how this probability changes over time.

Evaluation

Our project will be mainly quantitative analysis in nature.