Das Sarma et. al., Dynamic Relationship and Event Discovery, WSDM 2011

From Cohen Courses
Jump to navigationJump to search

Reviews of this paper

Citation

Dynamic relationship and event discovery, by A. Das Sarma, A. Jain, C. Yu. In Proceedings of the fourth ACM international conference on Web search and data mining, 2011.

Online version

This paper is available online [1].

Summary

This paper presents a different take on the problem of temporal information extraction. Unlike other works that focus on assigning time to categories or relations that are extracted according to a schema, this paper aims to extract relations that are not predefined by an existing schema, rather relations that are dynamic and temporally defined (i.e. co-bursting relationships).

The paper also identifies temporally constrained events underlying these relationships by consolidating binary dynamic relations between entities into an n-ary relation involving a set of entities connected at a given time period. Two formal notions are introduced for detecting these dynamic events: global temporal constraint cluster and local temporal constraint cluster.

The motivation behind this paper is to extract dynamic events (n-ary relations) that cannot be predefined - for example, the majority of relations associated with news events: boating accident, plane crash landing due to landing gear failure, etc.

Brief description of the method

The challenges specific to the work of extracting dynamic relations and associated events are multi folds:

  • the relations change dynamically thus cannot be predefined with a schema
  • the duration of each relation differs from one relation to another, thus cannot be defined a priori
  • the number of entities involved and their degrees of involvement varies from one relation to another


The paper introduces a pipeline system called DROP that given a set of entities and a time period:

  • discovers pairs of entities that are co-bursting around the same time period: indicating there are dynamic relationships between the entities
The definition of co-bursting depends on the data source. In a query log, two entities are co-bursting if they appear in more than usual number of queries around the same time. In a news data, two entities are co-bursting if they appear close together in a large number of news documents in the given time period.
Peak Detection strategies are used to detect if an entity is peaking (bursting) in a time window. Two entities are co-bursting if they share a consecutive sequence of peaking time windows. The strength of their co-bursting is measured using co-occurrence based strategy: the number of documents in which both entities appear divided by the product of the numbers of documents each entity appears in (i.e. the Pointwise mutual information (PMI)).
  • verifies and clusters these binary dynamic relations into a set of events (n-ary relations) taking into account global and local temporal constraints
Once binary dynamic relations between entities are found, a Pairwise Temporal Graph (PTG) is constructed in which vertices are entities and edges are created between vertices if they co-peak in at least one week. Each edge is annotated with the set of weeks that a dynamic relationship (co-peaking) exists for the two vertices. A dynamic event is then considered as a cluster of vertices and edges in this PTG. Clustering of vertices and edges in this PTG is done while respecting Global Temporal Constraint (GTC) and Local Temporal Constraint (LTC):
  • GTC allows for formation of events (clusters of vertices and edges in the PTG) as long as the weeks in the edges are not separated by more than K weeks. This constraint is a way to represent the prior knowledge that an event may span up to a specific number of K weeks.
  • LTC allows for formation of events (clusters or vertices and edges in the PTG) as long as any two edges sharing the same vertex share a common week. Furthermore, each of the consecutive week within the time period of the cluster must have some edges. This constraint ensures that events are continuous: each consecutive week is supported by some dynamic relationships; and that entities involved have overlapping co-peaking periods.
  • enriches each extracted event with attributes about (1) its entities involvement (which entities are primary participants and which are secondary participants), (2) event confidence (the likelihood of the set of all participating entities being dynamically connected during the particular time interval), (3) description (based on bag of words surrounding mentions of its entities in documents) and (4) popularity (based on the number of documents mentioning keywords in the event description and participating entities; weighted by the relevance of each keyword and the involvement score of each entity).

Experimental Result

The experiment focuses on events about people as entities. A list of entities involving people are obtained from Wikipedia. Two evaluation approach are done: (1) entity-based and (2) list-based.

In the entity-based evaluation, a sample of 30 entities is taken from the list of entities. Gold-set of events about these entities are constructed by examining various web-based sources (Wikipedia, official home pages, news search and web search) to identify all events involving each entity in the sample. For each discovered event, the (1) time period for the event and (2) all other entities involved in the event are recorded. Precision and recall values are measured for matching DROP and Gold-set events: i.e. events that occur in the same time period and have at least one entities in common.

Precision are measured by the fraction of entities in DROP events that are participants of the gold events. Recall are measured by the fraction of entities in gold events that are found in the DROP events.

In the list-based evaluation, a sample of 35 events produced by each method (LTC, GTC, TRJ: simple, time-agnostic clustering of connected vertices in the PTG graph) is taken. For each sample event and its participating entities, a real-world event that best explains the co-occurrence of these entities for the specified time period is identified. A gold event is a subset of entities within the sample event that actually participated in the real-world event. Precision and recall is again measured like in the entity-based evaluation.

In the entity-based evaluation, TRJ has good precision and recall because it produces large, time-agnostic clusters (events) that basically involves most of the entities. Therefore, its recall is high because its large events contain almost all the entities in the gold events. Its precision is also high because it produces large events and some very small ones. Although the large events have low precision, the small ones have high precision; taking the average results in an overall high precision. However, the events produced by TRJ are not meaningful as basically it simply dumps most of the entities which may or may not be temporally related in a cluster.

In the list-based evaluation, LTC and GTC outperform TRJ in both precision and recall. Both LTC and GTC exhibit around 21% higher precision and 47% and 24% higher recall respectively than that of TRJ that has lower precision due to its large clusters: hence the fraction of participants that are actually involved in real-world events are low. TRJ has also lower recall because it produces fewer events.

Discussion

This paper presents an interesting take on the problem of temporal information extraction by finding dynamic events that are not predefined by any schema, rather is a result of entities that co-bursting together in a time period. However, a reader cannot help but question if the paper's assumption is valid that entities which co-burst together in time indeed have a relationship. In the case of celebrities for example, as they mention in the paper, some celebrities appear to be co-occurring a lot together in documents even though they are not involved in any relationships. Buzzy entities (entities that appear a lot in documents no matter when) may also cause an issue because they will appear to co-burst with many other unrelated entities. A better evaluation of the approach may be necessary. The fact that TRJ, which produces less than useful events, results in higher precision and recall in their entity-based evaluation also highlights this need to have a better evaluation approach and measure.

Another drawback to this paper is the lack of standardized data set for training and testing the approach. The choice of entities to list and events to sample seem arbitrary: take a random 30 entities and a random 35 entities (why 30 and 35?). The novelty of the task that this paper is addressing may cause finding standard data set difficult. In future a better and more standardized data set could be created to better train, test, and compare approaches in this area.

Related papers

Due to the novelty of the problem that this paper tries to address, there are still few works that are closely related to it. Perhaps the closest work is by Q. Zhao, P. Mitra, and B. Chen. Temporal and information flow based event detection from social text streams. In AAAI, 2007 where events are detected using keywords and a variety of signals including temporal and social constraints.

Although the paper is novel in terms of its take on temporal information extraction, the idea of using co-occurrence to identify relationships among entities are widespread in other applications. This idea of using distributional clustering has been used in Banko_2007_Open_Information_Extraction_from_the_Web to mine co-occurring entities from large collection of documents while being agnostic to the real world relations causing the co-occurrence. In this paper, instead of document co-occurrence, temporal co-occurrence is used to mine related entities.

Other related work that uses distributional clustering to discover structure in documents is Chambers, N. and Jurafsky, D. Template-based information extraction without the templates, ACL 2011. Although the relationship is not immediately clear (due to the non-temporal nature of the paper), this paper uses very similar ideas of co-occurrence and clustering to discover event templates in documents. It may be interesting to draw the similarities and differences between the two papers.