Difference between revisions of "Project Ideas - Derry, Reyyan"

From Cohen Courses
Jump to navigationJump to search
(Created page with 'Social Media Analysis Project Ideas == Team Members == Derry Wijaya [dwijaya@cs.cmu.edu] Reyyan Yeniterzi [reyyan@cs.cmu.edu] == Project Idea…')
 
 
(25 intermediate revisions by 2 users not shown)
Line 11: Line 11:
 
We have several possible ideas for the project:
 
We have several possible ideas for the project:
  
• We propose to do a mapping of event to opinion. An event can be social or political in nature, which brings about a change in opinion or vice versa.
+
• We propose to do a mapping of event to opinion. An event can be social or political in nature, which brings about a change in opinion or vice versa.
• We propose to analyze opinions from the perspective of associative sorting and social contagion. For example, to answer question on when does an opinion get pushed aside? i.e. centrality and periphery of opinions in the opinion-graph.
+
 
• We propose to construct a social graph, but instead of people, we put words as nodes. Using this social graph of words, we propose to analyze: (1) how co-occurrence with other words (associativity with other words) can influence meaning of words (for example, the word 'BP' was frequently 'associated' (co-occurred) with negative words during and after the Gulf-spill event), (2) how new words emerge in the graph (like ‘Google’), or a new part of speech (like 'googling'), (3) how meaning and usage of words like “LOL” changes with time - from meaning “laughing out loud”, to “whatever”
+
• We propose to analyze opinions from the perspective of associative sorting and social contagion. For example, to answer question on when does an opinion get pushed aside? i.e. centrality and periphery of opinions in the opinion-graph.
• We propose to automatically create social graph on opinions from tweets, where nodes are people, links are follower/following relations, colors are attributes (positive or negative towards the entity we are interested in: like ‘toyota’, ‘ford’, etc)
+
 
 +
• We propose to construct a social graph, but instead of people, we put words as nodes. Using this social graph of words, we propose to analyze:  
 +
 
 +
(1) how co-occurrence with other words (associativity with other words) can influence meaning of words (for example, the word 'BP' was frequently 'associated' (co-occurred) with negative words during and after the Gulf-spill event),  
 +
 
 +
(2) how new words emerge in the graph (like ‘Google’), or a new part of speech (like 'googling'),  
 +
 
 +
(3) how meaning and usage of words like “LOL” changes with time - from meaning “laughing out loud”, to “whatever”.
 +
 
 +
• We propose to automatically create social graph on opinions from tweets, where nodes are people, links are follower/following relations, colors are attributes (positive or negative towards the entity we are interested in: like ‘Toyota’, ‘Ford’, etc).
  
 
== Dataset ==
 
== Dataset ==
Line 20: Line 29:
 
For each of the ideas above, we propose to use (in order of the ideas):
 
For each of the ideas above, we propose to use (in order of the ideas):
  
+
The dataset of the TREC Blog Track: Blog08 corpus and TRC2 (news) corpus that are from the same time-span.
  
We then define a variety of features over pairs of such chains. These include all word TF-IDF similarity, proper noun TF-IDF similarity, proper noun Soft TF-IDF similarity, Soft TF-IDF similarity between the names (representative named mentions) of each chain, the semantic similarity between the descriptions (representative common nouns or noun phrases) of each chain, etc.
+
[[UsesDataset::Yano & Smith blog dataset|Yano & Smith blog dataset]] or [[UsesDataset::politics.com dataset|politics.com dataset]] or [[UsesDataset::U.S. Floor debates dataset|U.S. Floor debates dataset]].
  
Using these features, we train an SVM ([[Tool::LIBSVM -- A Library for Support Vector Machines|libSVM]]) that classifies pairs of chains as being co-referent or not
+
News data such as TRC2 (news) corpus.
  
We take the outputs of this classifier and cluster all the chains that we have gathered from all the documents in the corpus.
+
Twitter data (perhaps [http://www.ark.cs.cmu.edu/GeoText/ GeoText] data).
 
 
• We store a persistent database of entities using this clustering, whereby each cluster represents a real-world entity. In other words, an entity is a list of chains in our database.
 
  
 
== Motivation ==
 
== Motivation ==
  
• We wish to augment our CDC system to store more information for entities than just a list of chains. It would be helpful to retain a summary of useful attribute information for each entity, such as gender, nationality, occupation, email address, phone number, etc.
+
For each of the ideas above, our motivations are (in order of the ideas):
  
We also believe that by extracting such attributes at the chain level and using them as additional features in our SVM, we may be able to improve the performance of our CDC system.  
+
It will be interesting to find out how an event reported in a news article can change a blogger's opinion on the related topic. How often bloggers start writing about a topic for the first time after reading about a related event in the news?
  
Our current cross-document visualization tool is only capable of modeling relationships using co-occurrence statistics, and we wish to have a more descriptive way of representing relationships
+
It will be interesting to find out whether centrality and betweenness apply to a graph of opinions. A graph can be constructed where each node is a piece of opinion and the edges are similarities between the opinions. Can we then find in the graph, which opinion(s) is(are) the ringleaders? Are there neutral or indecisive opinions that act as go-between between different groups of opinions? How cohesive are the groups of opinions? How does the graph change overtime? Are there spatial segregation in the graph (where minority opinions) are pushed to the periphery of the graph?
  
On a broader level, we wish to examine the upper limit of recall and precision associated with these problems, i.e. find answers to the questions:
+
It will be interesting to find out whether homophily occurs in words. If a word starts to 'hang out' (tend to co-occur) with negatively associated words, will its semantic and usage become negative? (social contagion) Do negative words tend to co-occur together? (associative sorting). How does the semantic of a word change depending on its neighbor (i.e. co-occurring words)?
  
o For how many entities does a given attribute exist in the data?
+
• It will be interesting to do opinion mining on Twitter data, to find out whether follower/following links have an influence in the spread of opinions in Twitter; or if people from the same Geo-location will tend to have the same opinions. Another interesting thing is to find out whether we can predict whether a person will become a follower of/be followed by another person based on similarity of their follower/following links, similarity of opinions, temporal-coincidence of the opinions, and geographic coincidence: i.e. whether two persons with '''''a''''' similar followers, who follow '''''b''''' similar people, who has '''''c''''' degree of opinion similarity, who voice their opinions within '''''d''''' days of each other, and who are located in '''''e''''' geographical distance apart are likely to follow/be-followed by one another?
  
o For all such attributes, how accurately can we extract them?
+
== Techniques ==
  
== Dataset ==
+
For each of the ideas above, proposed techniques or related papers are (in order of the ideas):
+
 
To train and test our attribute and relation extraction modules, we plan to use one of the various ACE datasets (probably ACE 2004 or [[UsesDataset::ACE 2005 Dataset|ACE 2005]]).  
+
Clustering of opinions. Finding when a group of opinions break into two in time (to detect the time '''''t''''' where a change in opinion occurs, followed by the grow of another group of opinion cluster). Topic modeling of news document to pinpoint the particular event at that time '''''t''''' that may cause the change. Related recent paper: [http://upinion.cse.buffalo.edu/beta/SOMApaper.pdf Identifying Breakpoints in Public Opinion].
  
For our CDC system, we are using the John Smith corpus, and WePS corpora, and a set of 400000 news articles over summer 2010, produced and labeled by a commercial organization.
+
Using centrality and betweenness measures in social network analysis, but applied to a network of opinions (Related paper: [http://onlinelibrary.wiley.com/doi/10.1002/asi.20614/pdf Betweenness Centrality as an Indicator of the Interdisciplinarity of Scientific Journals]). Random walk on the graph to find ring leaders and clusters of opinions. Schelling segregation to measure spatial segregation (we first need to define what 'space' means in the graph of opinions). A related paper to segregation in graph is [http://www.nejm.org/doi/pdf/10.1056/NEJMsa0706154 The Collective Dynamics of Smoking in a Large Social Network].  
  
== Techniques ==
+
• Regression analysis to measure tendency of a word to become negative in meaning over time, when co-occurred with negative words (Related paper: [http://www.nejm.org/doi/pdf/10.1056/NEJMsa066082 The Spread of Obesity in a Large Social Network over 32 Years] - applied to measuring the spread of negativity in a network of words).
  
For attribute extraction, we plan to implement standard algorithms that take seed examples of entities and attributes and learn extraction patterns, as introduced by Ravichandran and Hovy, 2002 “Learning surface text patterns for a question answering system”. [http://portal.acm.org/citation.cfm?id=1073083.1073092]
+
Using Bayes rule to measure probability of two people having a link in Twitter based on their friends links and opinions and spatial-temporal overlap. An interesting relation to a recent paper [http://www.pnas.org/content/early/2010/12/02/1006155107.full.pdf Inferring social ties from geographic coincidences].
  
• For relationship extraction, we plan to implement one of the papers referenced by Sunita Sarawagi in her survey on Information Extraction. [http://www.it.iitb.ac.in/~sunita/papers/ieSurvey.pdf]
+
== Evaluation ==
  
• We may use different methods if we come across better ones while surveying related literature over the course of the semester
+
A combination of manual evaluation and cross validation (splitting the data into training and testing and evaluate) may be done.
  
 
== Superpowers ==
 
== Superpowers ==
  
We have none. But in terms of our individual backgrounds, Bo and Rushin have been working with Bob Frederking and Anatole Gershman on entity extraction and co-reference resolution [http://www.cs.cmu.edu/~encore], and Kevin has been working on question answering and computer assisted language learning.
+
• Nothing really at the moment, except for a bag full of ideas and a lot of keenness in pursuing at least one of them well.

Latest revision as of 23:50, 31 January 2011

Social Media Analysis Project Ideas

Team Members

Derry Wijaya [dwijaya@cs.cmu.edu]

Reyyan Yeniterzi [reyyan@cs.cmu.edu]

Project Ideas

We have several possible ideas for the project:

• We propose to do a mapping of event to opinion. An event can be social or political in nature, which brings about a change in opinion or vice versa.

• We propose to analyze opinions from the perspective of associative sorting and social contagion. For example, to answer question on when does an opinion get pushed aside? i.e. centrality and periphery of opinions in the opinion-graph.

• We propose to construct a social graph, but instead of people, we put words as nodes. Using this social graph of words, we propose to analyze:

(1) how co-occurrence with other words (associativity with other words) can influence meaning of words (for example, the word 'BP' was frequently 'associated' (co-occurred) with negative words during and after the Gulf-spill event),

(2) how new words emerge in the graph (like ‘Google’), or a new part of speech (like 'googling'),

(3) how meaning and usage of words like “LOL” changes with time - from meaning “laughing out loud”, to “whatever”.

• We propose to automatically create social graph on opinions from tweets, where nodes are people, links are follower/following relations, colors are attributes (positive or negative towards the entity we are interested in: like ‘Toyota’, ‘Ford’, etc).

Dataset

For each of the ideas above, we propose to use (in order of the ideas):

• The dataset of the TREC Blog Track: Blog08 corpus and TRC2 (news) corpus that are from the same time-span.

Yano & Smith blog dataset or politics.com dataset or U.S. Floor debates dataset.

• News data such as TRC2 (news) corpus.

• Twitter data (perhaps GeoText data).

Motivation

For each of the ideas above, our motivations are (in order of the ideas):

• It will be interesting to find out how an event reported in a news article can change a blogger's opinion on the related topic. How often bloggers start writing about a topic for the first time after reading about a related event in the news?

• It will be interesting to find out whether centrality and betweenness apply to a graph of opinions. A graph can be constructed where each node is a piece of opinion and the edges are similarities between the opinions. Can we then find in the graph, which opinion(s) is(are) the ringleaders? Are there neutral or indecisive opinions that act as go-between between different groups of opinions? How cohesive are the groups of opinions? How does the graph change overtime? Are there spatial segregation in the graph (where minority opinions) are pushed to the periphery of the graph?

• It will be interesting to find out whether homophily occurs in words. If a word starts to 'hang out' (tend to co-occur) with negatively associated words, will its semantic and usage become negative? (social contagion) Do negative words tend to co-occur together? (associative sorting). How does the semantic of a word change depending on its neighbor (i.e. co-occurring words)?

• It will be interesting to do opinion mining on Twitter data, to find out whether follower/following links have an influence in the spread of opinions in Twitter; or if people from the same Geo-location will tend to have the same opinions. Another interesting thing is to find out whether we can predict whether a person will become a follower of/be followed by another person based on similarity of their follower/following links, similarity of opinions, temporal-coincidence of the opinions, and geographic coincidence: i.e. whether two persons with a similar followers, who follow b similar people, who has c degree of opinion similarity, who voice their opinions within d days of each other, and who are located in e geographical distance apart are likely to follow/be-followed by one another?

Techniques

For each of the ideas above, proposed techniques or related papers are (in order of the ideas):

• Clustering of opinions. Finding when a group of opinions break into two in time (to detect the time t where a change in opinion occurs, followed by the grow of another group of opinion cluster). Topic modeling of news document to pinpoint the particular event at that time t that may cause the change. Related recent paper: Identifying Breakpoints in Public Opinion.

• Using centrality and betweenness measures in social network analysis, but applied to a network of opinions (Related paper: Betweenness Centrality as an Indicator of the Interdisciplinarity of Scientific Journals). Random walk on the graph to find ring leaders and clusters of opinions. Schelling segregation to measure spatial segregation (we first need to define what 'space' means in the graph of opinions). A related paper to segregation in graph is The Collective Dynamics of Smoking in a Large Social Network.

• Regression analysis to measure tendency of a word to become negative in meaning over time, when co-occurred with negative words (Related paper: The Spread of Obesity in a Large Social Network over 32 Years - applied to measuring the spread of negativity in a network of words).

• Using Bayes rule to measure probability of two people having a link in Twitter based on their friends links and opinions and spatial-temporal overlap. An interesting relation to a recent paper Inferring social ties from geographic coincidences.

Evaluation

A combination of manual evaluation and cross validation (splitting the data into training and testing and evaluate) may be done.

Superpowers

• Nothing really at the moment, except for a bag full of ideas and a lot of keenness in pursuing at least one of them well.