Allan 1988
Allan James http://dl.acm.org/citation.cfm?id=290954&dl=ACM&coll=DL&CFID=119212228&CFTOKEN=52277574
Contents
Citation
@inproceedings{conf/sigir/AllanPL98,
author = {James Allan and Ron Papka and Victor Lavrenko}, title = {On-Line New Event Detection and Tracking}, booktitle = {SIGIR}, year = {1998}, pages = {37-45}, ee = {http://doi.acm.org/10.1145/290941.290954}, crossref = {conf/sigir/98},
}
Abstract
We define and describe the related problems of new event detection and event tracking within a stream of broadcast news stories. We focus on a strict on-line setting-i.e., the system must make decisions about one story before looking at any subsequent stories. Our approach to detection uses a single pass clustering algorithm and a novel thresholding model that incorporates the properties of events as a major component. Our approach to tracking is similar to typical information filtering methods. We discuss the value of surprising features that have unusual occurrence characteristics, and briefly explore on-line adaptive filtering to handle evolving events in the news. New event detection and event tracking are part of the Topic Detection and Tracking (TDT) initiative.
Online version
Data Collection
An important task of the TDT pilot study was the creation of an appropriate test corpus and a useful approach to evaluation of the problem. The goals of creating the corpus and evaluation methodology were two-fold: (1) to make strides toward a solid definition of event as outlined in Section 2.1, and (2) to evaluate how well state of the art approaches could address the TDT tasks. The resuiting TDT corpus includes 15,863 news stories spanning July 1, 1994, through June 30, 1995. Half of the stories are randomly chosen Reuters news articles from that period; the other half are transcripts of several CNN broadcast news shows during the same period. The stories are assigned an ordering that represents the order that they appeared in the news. The average story contains 460 (210 unique) single-word features after stemming and removing stopwords.
Evaluation Metric
In the TDT setting, we have chosen to measure a system¡¯s effectiveness primarily by the miss (false negative)we skirt the threshold issue by using a Detection Error Tradeoff curve to show how false alarm and miss rates vary with respect to each other at various threshold values. and false alarm (false positive or fallout) rates.
Event Detection Method
New event detection operates in a strict on-line setting, processing stories from a news stream one at a time as they arrive. Our approach to the problem is a modification of the well-known single pass clustering algorithm. Our algorithm processes each new story on the stream sequentially, as follows:
- Use feature extraction and selection techniques to
build a query representation for the story content.
- Determine the query initial threshold by evaluating
the new story with the query.
- Compare the new story against earlier queries in
memory.
- If the story does not trigger any previous query by
exceeding its threshold, flag the story as containing a new event.
- If the story triggers an existing query, flag the story
as not containing a new event.
- (Optional) Add the story to the agglomeration list
of queries it triggered.
- (Optional) Rebuild existing queries using the story.
- Add new query to memory.
We represent the content of each story (which we assume discusses some event) as a query. If a new story triggers an existing query, the story is assumed to discuss the event represented in the query, otherwise it contains a new event.
Results Analysis
Misses occur when stories containing new events are labeled as not new. When the representation used a small number of features, misses were mostly the result of failing to weight specific event features more heavily than features descriptive of a class of events. the system was unable to detect certain events that are discussed in the news at different levels of granularity. However, we hypothesize that several of the problems revealed in the failure analysis could be resolved with a different weight assignment strategy for query features.
Event Tracking Method
the goal of the system is to begin tracking immediately. Unfortunately, events occur at different times, meaning that it is nearly impossible to use the same training and test set for each event.
Surprising Feature
It is a characteristic of news reporting that stories about the same event often occur in clumps. This effect is particularly true for unexpected events (e.g., disasters or major crimes) where the news media exhibit strong interest in a story and report in nearly endless detail about it. As the triggering event fades into the past, the stories discussing the event similarly fade.
A second characteristic of news coverage is that the people, places, and other items of interest in a story are likely not to have been mentioned very often in the past. This supposition is obviously not true for all features (e.g., the name of the President of the U.S. is likely to reoccur), but there must be something about the story that makes its appearance worthwhile.
Interesting Points
New event detection is an abstract document classification task that we have shown has reasonable solutions using a single pass clustering approach. We have presented an evaluation methodology based on miss and false alarm rates, measures that are more closely related to the task than recall and precision. System misses and false alarms were used to measure detection error in a cross-validation approach that found stable system parameters for our implementation. We described overall system performance using a bootstrap method that produced performance distributions for the TDT corpus.
Relative Work
- A Statistical Model for Popular Events Tracking in Social Communities. Lin et al, KDD 2011 This paper address a method to observe and track the popular events or topics that evolve over time in the communities.
- A study on retrospective and online event detection. Yang et al, SIGIR 98 This paper addresses the problems of detecting events in news stories.
- Temporal and information flow based event detection from social text streams. Zhao et al, AAAI 07 This paper addresses the problems of detecting events in news stories.
- Automatic Detection and Classification of Social Events. Agarwal and Rambow, ACL 10 This paper aims at detecting and classifying social events using Tree kernels.
- Detecting controversial events from Twitter. Popescu and Pennacchiotti, CIKM 10 This paper addresses the task of identifying controversial events using Twitter as a starting point.
- Information credibility on twitter. Castillo et al, WWW 11 The authors develop a general approach to change-point detection that generalize across wide range of application.