Difference between revisions of "Comparison Das et al WSDM 2011 and Zhao et al AAAI 2007"

From Cohen Courses
Jump to navigationJump to search
m
m
 
(15 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
This is a comparison of two related papers in [[event detection]] and [[temporal information extraction]].
 
This is a comparison of two related papers in [[event detection]] and [[temporal information extraction]].
The papers are
 
* Qiankun Zhao, Prasenjit Mitra, and Bi Chen. [Zhao_et_al,_AAAI_07 Temporal and information flow based event detection from social text streams]. In Proceedings of the 22nd national conference on Artificial intelligence - Volume 2, pages 1501–1506. AAAI Press, 2007. [http://www.purdue.edu/discoverypark/vaccine/assets/pdfs/publications/pdf/Temporal%20and%20Information%20Flow%20Based.pdf]
 
  
 +
== Papers ==
  
 +
The papers are
 +
* Anish Das Sarma, A. Jain, C. Yu. [[Das_Sarma_et._al.,_Dynamic_Relationship_and_Event_Discovery,_WSDM_2011|Dynamic relationship and event discovery]]. In Proceedings of the fourth ACM international conference on Web search and data mining, 2011. [http://web.eecs.umich.edu/~congy/work/wsdm11.pdf]
 +
* Qiankun Zhao, Prasenjit Mitra, and Bi Chen. [[Zhao_et_al,_AAAI_07|Temporal and information flow based event detection from social text streams]]. In Proceedings of the 22nd national conference on Artificial intelligence - Volume 2, pages 1501–1506. AAAI Press, 2007. [http://www.purdue.edu/discoverypark/vaccine/assets/pdfs/publications/pdf/Temporal%20and%20Information%20Flow%20Based.pdf]
  
== Citation ==
+
== Comparative analysis of both papers ==
 
 
 
 
== Online version ==
 
 
 
 
 
 
 
== Summary ==
 
 
 
The authors presents a method for detecting events from social text stream by exploiting more than just the textual content, but also exploring the temporal and social dimensions of their data.
 
Social text streams are represented as multigraphs where each node denote an "actor" and an edge represents the information flow between two actors.
 
First, the authors did content based [[UsesMethod::clustering]] using a vector space model (tf-idf weights, cosine similarity, the works) and graph cut based clustering algorithm.
 
This clustering segments their data into topics.
 
 
 
For a given topic, they measure the "intensities" over time using a sliding time window and segment them into intervals using an adaptive time series model.
 
With the temporal segmentation, each topic is represented as a sequence of social network graphs over time.
 
The weight of edges between different actors in this graph denote their communication intensity, and one can measure the "information flow" between actors for a given topic over time.
 
  
With the above content, temporal and information flow data, they extract events by extracting text segments subject to constraints on these information. For instance, an event should be from the same time interval, be about the same topics and mainly between a certain sub group of social actors.
+
On a high level, both papers are interested in discovering events from large amount temporal information sources.
 +
Both of them leverage on user generated content, with Das et al using Wikipedia as their dataset, while Zhao et al used the [[UsesDataset::Enron email corpus]] and [[UsesDataset::Dailykos blogs]].
  
== Evaluation ==
+
In Das et al, their task was to first discover pairs of entities that were co-bursting in the same time period (of a week). Co-bursting means that both entities are mentioned significantly more than during other time periods.
 +
After which, the next step is to discover the relationships between such entities.
 +
This forms the foundation for an event, an n-ary relationship between entities that are bursty at the same time period.
 +
Likewise, Zhao et al's task is to discover events, exploiting the temporal burstiness property of entities and text, and also the "social" aspect, where an event is being talked about more than usual by "social actors".
  
They used the [[UsesDataset::Enron email corpus]] and [[UsesDataset::Dailykos blogs]] [http://www.dailykos.com/]. 30 events are manually labeled as ground truth in the dataset by looking for correspondance with real world news.
+
Method-wise, both papers framed the problem of identifying relationships in the context of graphs.
 +
In Das et al, vertices are entities and edges describe how much overlap two entities have in the time periods that they are bursty. So two entities who were mentioned more at the same time would have stronger edges between them.
 +
In Zhao et al, vertices are social actors. Social actors are not entities that are directly involved in an event (much unlike Das et al), they are just actors that converse (through text) about the event that is taking place. Edges between social actors are thus weighted by how intense pairs social actors communicate during the time period.
  
Performance is measured using precision/recall/fscore of how well events are recovered with their model.  
+
In Das et al's approach, events are thus assumed to be associated with two or more public entities, while Zhao et al's event are more associated with the topical nature of the discussions that are going on.
 +
The advantage of Das et al's approach is that events are easily interpretable, especially within the context of public news (entertainment news, political news, etc), which is often about specific public figures or organizations. However, it would not be able to capture abstract events, that do not have specific associated entities, say a natural disaster, where there is no specific entity it is associated with.
 +
Zhao et al's approach, on the other hand, would be able to identify such abstract events, however, their event topics may not be easily identifable.
  
== Discussion ==
+
Both papers made use of algorithms from time series models and graph clustering to solve their respective problems.
They found that taking temporal and social dimensions into account can increase their f-score significantly. Their approach of integrating these diverse features together in a step-wise manner was also found to perform better than just including features in a standard machine learning framework.
 
  
 
== Related papers ==
 
== Related papers ==
There has been a lot of work on event detection.
+
* [[RelatedPaper::Lin_et_al_KDD_2011|A Statistical Model for Popular Events Tracking in Social Communities. Lin et al, KDD 2011]]
* [[RelatedPaper::Lin_et_al_KDD_2011|A Statistical Model for Popular Events Tracking in Social Communities. Lin et al, KDD 2011]] This paper address a method to observe and track the popular events or topics that evolve over time in the communities.
+
* [[RelatedPaper::Popescu and Pennacchiotti, CIKM 10|Detecting controversial events from Twitter. Popescu and Pennacchiotti, CIKM 10]]
* [[RelatedPaper::Popescu and Pennacchiotti, CIKM 10|Detecting controversial events from Twitter. Popescu and Pennacchiotti, CIKM 10]] This paper addresses the task of identifying controversial events using Twitter as a starting point.
+
* [[RelatedPaper::Yang et al, SIGIR 98|A study on retrospective and online event detection. Yang et al, SIGIR 98]]
* [[RelatedPaper::Yang et al, SIGIR 98|A study on retrospective and online event detection. Yang et al, SIGIR 98]] This paper addresses the problems of detecting events in news stories.
+
* [[RelatedPaper::Automatic_Detection_and_Classification_of_Social_Events|Automatic Detection and Classification of Social Events]]
* [[RelatedPaper::Automatic_Detection_and_Classification_of_Social_Events|Automatic Detection and Classification of Social Events]] This paper aims at detecting and classifying social events using Tree kernels.
+
* [[RelatedPaper::Q. Zhao, P. Mitra, and B. Chen. Temporal and information flow based event detection from social text streams. In AAAI, 2007]]
 +
* [[RelatedPaper::Banko_2007_Open_Information_Extraction_from_the_Web]]
 +
* [[RelatedPaper::Chambers, N. and Jurafsky, D. Template-based information extraction without the templates, ACL 2011]].
  
== Study plan ==
+
== Questions ==
* Article: Adaptive time series model [http://www.siam.org/proceedings/datamining/2007/dm07_059Lemire.pdf]
+
# How much time did you spend reading the (new, non-wikified) paper you summarized? ''About 35 minutes.''
* Graph cut based clustering [http://www.cs.berkeley.edu/~malik/papers/SM-ncut.pdf]
+
# How much time did you spend reading the old wikified paper? ''About 35 minutes.''
 +
# How much time did you spend reading the summary of the old paper? ''About 15 minutes.''
 +
# How much time did you spend reading background materiel? ''About 30 minutes.''
 +
# Was there a study plan for the old paper? ''There wasn't an explicit study plan, but the article did provide a good background of the related papers that would be useful.''
 +
## if so, did you read any of the items suggested by the study plan? and how much time did you spend with reading them? ''Yes. I did a quick read of the [[Chambers, N. and Jurafsky, D. Template-based information extraction without the templates, ACL 2011]] paper. It took me about 10 minutes.''
 +
# Give us any additional feedback you might have about this assignment. ''The paper pairings was well chosen (at least for the papers I read). Doing a comparative analysis of two papers enable me to think more deeply about the different approaches to the same/similar problem and identify the pros/cons/assumptions of each, etc.''

Latest revision as of 23:49, 5 November 2012

This is a comparison of two related papers in event detection and temporal information extraction.

Papers

The papers are

Comparative analysis of both papers

On a high level, both papers are interested in discovering events from large amount temporal information sources. Both of them leverage on user generated content, with Das et al using Wikipedia as their dataset, while Zhao et al used the Enron email corpus and Dailykos blogs.

In Das et al, their task was to first discover pairs of entities that were co-bursting in the same time period (of a week). Co-bursting means that both entities are mentioned significantly more than during other time periods. After which, the next step is to discover the relationships between such entities. This forms the foundation for an event, an n-ary relationship between entities that are bursty at the same time period. Likewise, Zhao et al's task is to discover events, exploiting the temporal burstiness property of entities and text, and also the "social" aspect, where an event is being talked about more than usual by "social actors".

Method-wise, both papers framed the problem of identifying relationships in the context of graphs. In Das et al, vertices are entities and edges describe how much overlap two entities have in the time periods that they are bursty. So two entities who were mentioned more at the same time would have stronger edges between them. In Zhao et al, vertices are social actors. Social actors are not entities that are directly involved in an event (much unlike Das et al), they are just actors that converse (through text) about the event that is taking place. Edges between social actors are thus weighted by how intense pairs social actors communicate during the time period.

In Das et al's approach, events are thus assumed to be associated with two or more public entities, while Zhao et al's event are more associated with the topical nature of the discussions that are going on. The advantage of Das et al's approach is that events are easily interpretable, especially within the context of public news (entertainment news, political news, etc), which is often about specific public figures or organizations. However, it would not be able to capture abstract events, that do not have specific associated entities, say a natural disaster, where there is no specific entity it is associated with. Zhao et al's approach, on the other hand, would be able to identify such abstract events, however, their event topics may not be easily identifable.

Both papers made use of algorithms from time series models and graph clustering to solve their respective problems.

Related papers

Questions

  1. How much time did you spend reading the (new, non-wikified) paper you summarized? About 35 minutes.
  2. How much time did you spend reading the old wikified paper? About 35 minutes.
  3. How much time did you spend reading the summary of the old paper? About 15 minutes.
  4. How much time did you spend reading background materiel? About 30 minutes.
  5. Was there a study plan for the old paper? There wasn't an explicit study plan, but the article did provide a good background of the related papers that would be useful.
    1. if so, did you read any of the items suggested by the study plan? and how much time did you spend with reading them? Yes. I did a quick read of the Chambers, N. and Jurafsky, D. Template-based information extraction without the templates, ACL 2011 paper. It took me about 10 minutes.
  6. Give us any additional feedback you might have about this assignment. The paper pairings was well chosen (at least for the papers I read). Doing a comparative analysis of two papers enable me to think more deeply about the different approaches to the same/similar problem and identify the pros/cons/assumptions of each, etc.