Difference between revisions of "Controversial events detection"

From Cohen Courses
Jump to navigationJump to search
m
 
(36 intermediate revisions by 4 users not shown)
Line 1: Line 1:
= Controversial event detection =
+
== Comments ==
 +
 
 +
This is a neat idea.  The main difficulty I see here is formalizing the task precisely. What does it mean for an event to be controversial, exactly?  Part of the problem is that it's not perfectly clear what an "event" is.
 +
 
 +
One suggestion would be to look at a topic-modeling approach, eg [http://dl.acm.org/citation.cfm?id=1150450 topics over time], to find topics with a short temporal span in social-media data.  You might be able to combine this with sentiment around those topics in two different communities - eg using something like my [http://www.cs.cmu.edu/~wcohen/postscript/icwsm-2012.pdf MCR-LDA model].  So one way to flesh out this idea would be to start with two topic models:
 +
 
 +
* MCR-LDA, to measure 'controversy' - you might be able to get predictions from Ramnath on his blog data, if the code's not ready to distribute yet.  I would not completely commit to using twitter data exclusively, btw.
 +
* TOT, to detect shortlived 'events' vs long-term topics.
 +
 
 +
Then write some inference code to combine the predictions and pick out "controversial events".  The next stage would be working out a joint model (which you might not chose to do for the project).  It's not obvious how you'd evaluate all this, however...maybe do some user labeling of final predictions like "this topic corresponds to a controversial event."
 +
 
 +
These are just ideas - you might try and flesh out some other concrete idea instead.  Good luck! --[[User:Wcohen|Wcohen]] 14:33, 10 October 2012 (UTC)
 +
 
 +
PS. There is also a one-person team working on similar topic, you all should talk - it's [[User:Yuchen Tian]] --[[User:Wcohen|Wcohen]] 18:40, 10 October 2012 (UTC)
  
 
== Team members ==
 
== Team members ==
  
 
+
* [[User:Ysim|Yanchuan Sim]]
[User:Ysim Yanchuan Sim]
+
* [[User:Zhouyu|Zhou Yu]]
Yu Zhou
+
* [[User:Tinghuiz|Tinghui Zhou]]
  
 
== Project idea ==
 
== Project idea ==
Line 18: Line 31:
 
We propose to use a probabilistic graphical model to achieve our goals of learning these latent structures from the data without labeled training data.
 
We propose to use a probabilistic graphical model to achieve our goals of learning these latent structures from the data without labeled training data.
  
== Data ==
+
== Formalizing the task ==
 +
 
 +
Event - In the context of social media, an event is a period of time where there is a "surge" in the amount of interest (i.e. blog posts, tweets, comments, etc) surrounding the occurrence.
 +
 
 +
We call this event controversial if given the text surrounding the event, the nature of the discussions are highly non-homogeneous (or exhibit high entropy). Each side of this event can be grouped together into a small number of distinct ''factions''.
 +
 
 +
Thus, in our task, given a collection of social media documents over time, we seek to jointly infer the the events that have occurred, as well as the controversy associated with it.
 +
 
 +
== A probabilistic model ==
 +
 
 +
Here's a sketch of a topic model that we are considering for our task.
 +
It is a variant of a topic model, where each word is assumed to be jointly generated by an ''event'' and ''faction''.
 +
It is also similar to the topic over time model, where we generate the time stamps for each document.
 +
 
 +
''A graphical plate diagram of our model will be up soon.''
 +
 
 +
=== Notation ===
 +
 
 +
<math>E</math> - fixed number of events
 +
 
 +
<math>\theta_d</math> - multinomial distribution of events specific to document <math>d</math>
 +
 
 +
<math>\phi_{e_{di}}</math> - multinomial distribution of factions specific to event <math>e_{di}</math>
 +
 
 +
<math>\psi_{e_{di}}</math> - the beta distribution of time specific to event <math>e_{di}</math>
 +
 
 +
<math>w_{di}</math> - the <math>i</math>th token in document <math>d</math>
 +
 
 +
<math>t_{di}</math> - timestamp associated with the <math>i</math>th token in document <math>d</math>
 +
 
 +
<math>\eta^e, \eta^{e,f}, \eta^m</math> - SAGE vectors, which are log additive weights for each word in the vocabulary. We have one for each event, each combination of event and faction, and a background word distribution.
 +
 
 +
=== Generative story ===
 +
 
 +
# Draw <math>E</math> multinomials, <math>\phi_e</math> from a Dirichlet prior, one for each event <math>e</math>. ''This is the distribution over factions for each event that we have.''
 +
# For each document <math>d</math>, draw a multinomial <math>\theta_d</math> from a prior <math>\alpha</math> (this prior could be Dirichlet or logistic normal); then for each word <math>w_{di}</math> in the document <math>d</math>:
 +
## Draw an event <math>e_{di}</math> from multinomial <math>\theta_d</math>;
 +
## Draw a faction <math>f_{di}</math> from multinomial <math>\phi_{e_{di}}</math>;
 +
## Draw a word <math>w_{di}</math> from a SAGE language model <math>p(w_{di} \mid e_{di}, f_{di}, \mathbf{\eta} ) \propto \exp(\eta^{e_{di}}_w + \eta^{e_{di},f_{di}}_w + \eta^m_w)</math>;
 +
## Draw a timestamp <math>t_{di}</math> from Beta <math>\psi_{e_{di}}</math>.
 +
 
 +
=== SAGE language model ===
 +
 
 +
To model the different effects of events and factions, we use a [[Sparse_Additive_Generative_Models_of_Text|sparse additive generative (SAGE)]] model.
 +
In contrast to the popular Dirichlet-multinomial for topic modeling, which directly models lexical probabilities associated with each (latent) topic, SAGE models the deviation in log frequencies from a background lexical distribution.
 +
Applying a sparsity inducing prior on the topic term vectors limits the number of terms whose frequencies diverge from the background lexical frequencies, thereby increasing robustness to limited training data.
 +
Also, in the case of our model, it eliminates the need for a switching variable to choose between event words and faction words.
  
 +
=== Logistic normal prior for events ===
  
 +
Using a logistic normal prior for events will allow us to incorporate features (such as Twitter hashtags, blog posts titles, comments count, etc) in a principled manner. Logistic normal priors have been used in [http://www.cs.princeton.edu/~mimno/papers/sampledlgstnorm.pdf here] and [http://delivery.acm.org/10.1145/1630000/1620766/p74-cohen.pdf here]
 +
 +
== Data and evaluation ==
 +
 +
We intend to experiment with two different sets of data:
 +
# Set of tweets collected over 12 weekends (Sep-Dec 2011)
 +
# Posts and comments from political blogs (relating to the presidential elections) in the year 2012
 +
 +
Over the 12 weekends from Sep-Dec, there are football games played every Sunday evenings.
 +
Football games present an obvious way for us to evaluate the performance of our model.
 +
Each of these games qualify as an event with a known time of occurrence.
 +
Additionally, we also know that there are at least two factions associated with each game (one set of fans for each team).
 +
One way of identifying factions would be to manually inspect the word vectors associated with the factions, identifying the teams that they are supporting.
 +
Another option is to leverage on the location metadata associated with each tweet.
 +
To identify factions with fans bases, we will compute the mean location (expressed as latitude and longitude) for each faction as the weighted average of words that draw from that faction, and then associate it  with the geographically closest NFL market (in terms of great-circle distance).
 +
 +
Also, significant events that have occurred during this period are 9/11 anniversary, Halloween, thanksgiving and Christmas.
 +
These events should have low entropy in the faction distribution of words within a document, which will serve as a reference for evaluating our model in terms of its ability to identify factions.
 +
 +
Blog posts provide substantially more content per document.
 +
Since this is an election year, hope to use data scraped from political blogs to qualitatively evaluate our model in its ability to pick up key election year events (like debates, primaries, conventions, Todd Akin-like controversial remarks, etc).
 +
Also, politics is one of the most contentious subject with much discussions and debates, which we hope our model will be able to learn the factions from.
  
 
== Related work ==
 
== Related work ==
 +
 +
* [[RelatedPaper::Yang et al, SIGIR 98|A study on retrospective and online event detection. Yang et al, SIGIR 98]] This paper addresses the problems of detecting events in news stories. They used clustering with a vector space model to group temporally close events together.
 +
 +
* [[RelatedPaper::Zhao et al, AAAI 07|Temporal and information flow based event detection from social text streams. Zhao et al, AAAI 07]] The authors proposes a method for detecting events from social text stream by exploiting more than just the textual content, but also exploring the temporal and social dimensions of their data.
 +
 +
* [[RelatedPaper::Automatic_Detection_and_Classification_of_Social_Events|Automatic Detection and Classification of Social Events. Agarwal and Rambow, ACL 10]] This is one of the few works we found relating to controversial events in social media. The authors aims at detecting and classifying social events using Tree kernels.
 +
 +
*[[RelatedPaper::Rodriguez et al. KDD 2010|Gomez Rodriguez, M., J. Leskovec, and A. Krause. 2010. Inferring networks of diffusion and influence. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, 1019–1028]]. This paper addresses the problem of inferring underlying networks in the diffusion process of social networks, which is related to the faction discovery problem we study in this project.
 +
 +
*[[RelatedPaper::Cosley et al 2010|Cosley, D., D. Huttenlocher, J. Kleinberg, X. Lan, and S. Suri. 2010. Sequential Influence Models in Social Networks, In Proc. 4th International Conference on Weblogs and Social
 +
Media]]. In this paper the authors study the temporal dynamics of information diffusion in social networks. The results found could give us some insights into the design of our model.
 +
 +
* [[RelatedPaper::Castillo_2011|Information credibility on twitter. Castillo et al, WWW 11]] Discover general features in twitter about credibility assessment.
 +
 +
* [[RelatedPaper::Guralnik_99|Event Detection from Time Series Data. Guralnik et al, KDD 99]] Develop a general approach to change-point detection that generalize across wide range of application
 +
 +
* [[RelatedPaper:: Allan_1988|On-Line New Event Detection and Tracking. Allan et al, SIGIR 98]] An approach to detection uses a single pass clustering algorithm and a novel thresholding model that incorporates the properties of events as a major component.
  
 
== Related materials ==
 
== Related materials ==

Latest revision as of 01:12, 16 October 2012

Comments

This is a neat idea. The main difficulty I see here is formalizing the task precisely. What does it mean for an event to be controversial, exactly? Part of the problem is that it's not perfectly clear what an "event" is.

One suggestion would be to look at a topic-modeling approach, eg topics over time, to find topics with a short temporal span in social-media data. You might be able to combine this with sentiment around those topics in two different communities - eg using something like my MCR-LDA model. So one way to flesh out this idea would be to start with two topic models:

  • MCR-LDA, to measure 'controversy' - you might be able to get predictions from Ramnath on his blog data, if the code's not ready to distribute yet. I would not completely commit to using twitter data exclusively, btw.
  • TOT, to detect shortlived 'events' vs long-term topics.

Then write some inference code to combine the predictions and pick out "controversial events". The next stage would be working out a joint model (which you might not chose to do for the project). It's not obvious how you'd evaluate all this, however...maybe do some user labeling of final predictions like "this topic corresponds to a controversial event."

These are just ideas - you might try and flesh out some other concrete idea instead. Good luck! --Wcohen 14:33, 10 October 2012 (UTC)

PS. There is also a one-person team working on similar topic, you all should talk - it's User:Yuchen Tian --Wcohen 18:40, 10 October 2012 (UTC)

Team members

Project idea

In our project, we propose to jointly detect events and the controversy surrounding it in the context of social media. For example, Christmas day is an event that receives the most attention around December 25th, while the Presidential debates once every four years. Controversy-wise, Christmas day is relatively one sided, with most of the text mentioning it being relatively homogeneous. In contrast, the Presidential debates event will have obvious sides (supporting the different candidates).

Our goal is not only to detect controversial events, but also to discover what the different sides are - both grouping the individuals associated with each faction and describing how each faction talks about the event differently.

We propose to use a probabilistic graphical model to achieve our goals of learning these latent structures from the data without labeled training data.

Formalizing the task

Event - In the context of social media, an event is a period of time where there is a "surge" in the amount of interest (i.e. blog posts, tweets, comments, etc) surrounding the occurrence.

We call this event controversial if given the text surrounding the event, the nature of the discussions are highly non-homogeneous (or exhibit high entropy). Each side of this event can be grouped together into a small number of distinct factions.

Thus, in our task, given a collection of social media documents over time, we seek to jointly infer the the events that have occurred, as well as the controversy associated with it.

A probabilistic model

Here's a sketch of a topic model that we are considering for our task. It is a variant of a topic model, where each word is assumed to be jointly generated by an event and faction. It is also similar to the topic over time model, where we generate the time stamps for each document.

A graphical plate diagram of our model will be up soon.

Notation

- fixed number of events

- multinomial distribution of events specific to document

- multinomial distribution of factions specific to event

- the beta distribution of time specific to event

- the th token in document

- timestamp associated with the th token in document

- SAGE vectors, which are log additive weights for each word in the vocabulary. We have one for each event, each combination of event and faction, and a background word distribution.

Generative story

  1. Draw multinomials, from a Dirichlet prior, one for each event . This is the distribution over factions for each event that we have.
  2. For each document , draw a multinomial from a prior (this prior could be Dirichlet or logistic normal); then for each word in the document :
    1. Draw an event from multinomial ;
    2. Draw a faction from multinomial ;
    3. Draw a word from a SAGE language model ;
    4. Draw a timestamp from Beta .

SAGE language model

To model the different effects of events and factions, we use a sparse additive generative (SAGE) model. In contrast to the popular Dirichlet-multinomial for topic modeling, which directly models lexical probabilities associated with each (latent) topic, SAGE models the deviation in log frequencies from a background lexical distribution. Applying a sparsity inducing prior on the topic term vectors limits the number of terms whose frequencies diverge from the background lexical frequencies, thereby increasing robustness to limited training data. Also, in the case of our model, it eliminates the need for a switching variable to choose between event words and faction words.

Logistic normal prior for events

Using a logistic normal prior for events will allow us to incorporate features (such as Twitter hashtags, blog posts titles, comments count, etc) in a principled manner. Logistic normal priors have been used in here and here

Data and evaluation

We intend to experiment with two different sets of data:

  1. Set of tweets collected over 12 weekends (Sep-Dec 2011)
  2. Posts and comments from political blogs (relating to the presidential elections) in the year 2012

Over the 12 weekends from Sep-Dec, there are football games played every Sunday evenings. Football games present an obvious way for us to evaluate the performance of our model. Each of these games qualify as an event with a known time of occurrence. Additionally, we also know that there are at least two factions associated with each game (one set of fans for each team). One way of identifying factions would be to manually inspect the word vectors associated with the factions, identifying the teams that they are supporting. Another option is to leverage on the location metadata associated with each tweet. To identify factions with fans bases, we will compute the mean location (expressed as latitude and longitude) for each faction as the weighted average of words that draw from that faction, and then associate it with the geographically closest NFL market (in terms of great-circle distance).

Also, significant events that have occurred during this period are 9/11 anniversary, Halloween, thanksgiving and Christmas. These events should have low entropy in the faction distribution of words within a document, which will serve as a reference for evaluating our model in terms of its ability to identify factions.

Blog posts provide substantially more content per document. Since this is an election year, hope to use data scraped from political blogs to qualitatively evaluate our model in its ability to pick up key election year events (like debates, primaries, conventions, Todd Akin-like controversial remarks, etc). Also, politics is one of the most contentious subject with much discussions and debates, which we hope our model will be able to learn the factions from.

Related work

Related materials