Difference between revisions of "Das Sarma et. al., Dynamic Relationship and Event Discovery, WSDM 2011"

From Cohen Courses
Jump to navigationJump to search
Line 37: Line 37:
 
== Experimental Result ==
 
== Experimental Result ==
  
The experiment focuses on events about people as entities. A list of entities are obtained by examining the page category for entities involving people in [[Wikipedia:http://www.wikipedia.org/
+
The experiment focuses on events about people as entities. A list of entities are obtained by examining the page category for entities involving people in [[Wikipedia:http://www.wikipedia.org/]].
  
 
Gold-set of events are constructed by examining various web-based sources (Wikipedia, official home pages, news search and web search) to identify all events involving an entity in their sample. For each discovered event, the (1) time period for the event and (2) all other entities involved in the event are recorded.  
 
Gold-set of events are constructed by examining various web-based sources (Wikipedia, official home pages, news search and web search) to identify all events involving an entity in their sample. For each discovered event, the (1) time period for the event and (2) all other entities involved in the event are recorded.  

Revision as of 21:25, 2 November 2011

Reviews of this paper

Citation

Dynamic relationship and event discovery, by A. Das Sarma, A. Jain, C. Yu. In Proceedings of the fourth ACM international conference on Web search and data mining, 2011.

Online version

This paper is available online [1].

Summary

This paper presents a different take on the problem of temporal information extraction. Unlike other works that focus on assigning time to categories or relations that are extracted according to a schema, this paper aims to extract relations that are not predefined by an existing schema, rather relations that are dynamic and temporally defined (i.e. co-bursting relationships).

The paper also identifies temporally constrained events underlying these relationships by consolidating binary dynamic relations between entities into an n-ary relation involving a set of entities connected at a given time period. Two formal notions are introduced for detecting these dynamic events: global temporal constraint cluster and local temporal constraint cluster.

The motivation behind this paper is to extract dynamic events (n-ary relations) that cannot be predefined - for example, the majority of relations associated with news events: boating accident, plane crash landing due to landing gear failure, etc.

Brief description of the method

The challenges specific to the work of extracting dynamic relations and associated events are multi folds:

  • the relations change dynamically thus cannot be predefined with a schema
  • the duration of each relation differs from one relation to another, thus cannot be defined a priori
  • the number of entities involved and their degrees of involvement varies from one relation to another

The paper introduces a pipeline system called DROP that given a set of entities and a time period:

  • discovers pairs of entities that are co-bursting around the same time period: indicating there are dynamic relationships between the entities
The definition of co-bursting depends on the data source. In a query log, two entities are co-bursting if they appear in more than usual number of queries around the same time. In a news data, two entities are co-bursting if they appear close together in a large number of news documents in the given time period.
Peak Detection strategies are used to detect if an entity is peaking (bursting) in a time window. Two entities are co-bursting if they share a consecutive sequence of peaking time windows. The strength of their co-bursting is measured using co-occurrence based strategy: the number of documents in which both entities appear divided by the product of the numbers of documents each entity appears in (i.e. the Pointwise mutual information (PMI)).
  • verifies and clusters these binary dynamic relations into a set of events (n-ary relations) taking into account global and local temporal constraints
Once binary dynamic relations between entities are found, a Pairwise Temporal Graph (PTG) is constructed in which vertices are entities and edges are created between vertices if they co-peak in at least one week. Each edge is annotated with the set of weeks that a dynamic relationship (co-peaking) exists for the two vertices. A dynamic event is then considered as a cluster of vertices and edges in this PTG. Clustering of vertices and edges in this PTG is done while respecting Global Temporal Constraint (GTC) and Local Temporal Constraint (LTC):
  • GTC allows for formation of events (clusters of vertices and edges in the PTG) as long as the weeks in the edges are not separated by more than K weeks. This constraint is a way to represent the prior knowledge that an event may span up to a specific number of K weeks.
  • LTC allows for formation of events (clusters or vertices and edges in the PTG) as long as any two edges sharing the same vertex share a common week. Furthermore, each of the consecutive week within the time period of the cluster must have some edges. This constraint ensures that events are continuous: each consecutive week is supported by some dynamic relationships; and that entities involved have overlapping co-peaking periods.
  • enriches each extracted event with attributes about (1) its entities involvement (which entities are primary participants and which are secondary participants), (2) event confidence (the likelihood of the set of all participating entities being dynamically connected during the particular time interval), (3) description (based on bag of words surrounding mentions of its entities in documents) and (4) popularity (based on the number of documents mentioning keywords in the event description and participating entities; weighted by the relevance of each keyword and the involvement score of each entity).

Experimental Result

The experiment focuses on events about people as entities. A list of entities are obtained by examining the page category for entities involving people in Wikipedia:http://www.wikipedia.org/.

Gold-set of events are constructed by examining various web-based sources (Wikipedia, official home pages, news search and web search) to identify all events involving an entity in their sample. For each discovered event, the (1) time period for the event and (2) all other entities involved in the event are recorded.

on the task of classifying temporal relations between events into before, after, or vague (unknown) relations on the TimeBank Corpus. These are the core relations in the TempEval-07 temporal ordering challenge.

Training and testing are done using 10-fold cross validation and micro-averaged accuracies. The method shows 3.6% absolute increase in the accuracy of before/after classification over the local, pairwise classification model. This improvement is statistically significant (p < 0.000001, Mc-Nemar's test, 2-tailed).

Using time expression normalization to create new relations between time expressions, transitive closure over the original set of temporal relations in the corpus, and additional event-event temporal information from Berthard et al. (2007), the method shows an 81% increase in the number of relations in the corpus to train on.

Discussion

Both the increased connectivity of the corpus and the global inference contributed to the improved performance. Global inference alone on the original set of temporal relations in the corpus shows no improvement over pairwise classification model. This is due to the sparseness of the corpus (since tagging is done manually, the vast majority of possible relations are untagged). Global constraints cannot assist local decisions if the graph is not connected. This highlights the importance of time expression normalization and transitive closure to make the corpus more well connected prior to conducting global inference.

Unfortunately, this paper does not differentiate between improvement in performance contributed by the increased connectivity and the improvement contributed by the global inference. Could the increased performance be contributed solely by the increased connectivity of the graph? There is a need for a separate evaluation on the accuracy of relations inferred using only pairwise classification and transitive closure (without global inference).

Related papers

An interesting follow-up paper is Yoshikawa 2009 jointly identifying temporal relations with markov logic which uses Markov Logic instead of Integer Linear Programming to do a softer (non-deterministic) joint inference.

Another related paper is Denis and Muller, Predicting Globally-Coherent Temporal Structures from Texts via Endpoint Inference and Graph Decomposition, IJCAI 2011 which attempts to classify all types of temporal relations (not just before/after) in TimeBank Corpus by first translating these 13 temporal interval relations to their end points, making the set of constraints much smaller for the Integer Linear Programming to deal with, while preserving the same temporal information.