Allan 1988

From Cohen Courses
Revision as of 00:34, 2 October 2012 by Zhouyu (talk | contribs) (→‎Feature)
Jump to navigationJump to search

Allan James http://dl.acm.org/citation.cfm?id=290954&dl=ACM&coll=DL&CFID=119212228&CFTOKEN=52277574

Citation

@inproceedings{conf/sigir/AllanPL98,

 author    = {James Allan and
              Ron Papka and
              Victor Lavrenko},
 title     = {On-Line New Event Detection and Tracking},
 booktitle = {SIGIR},
 year      = {1998},
 pages     = {37-45},
 ee        = {http://doi.acm.org/10.1145/290941.290954},
 crossref  = {conf/sigir/98},

}

Abstract

Abstract We define and describe the related problems of new event detection and event tracking within a stream of broadcast news stories. We focus on a strict on-line setting-i.e., the system must make decisions about one story before looking at any subsequent stories. Our approach to detection uses a single pass clustering algorithm and a novel thresholding model that incorporates the properties of events as a major component. Our approach to tracking is similar to typical information filtering methods. We discuss the value of surprising features that have unusual occurrence characteristics, and briefly explore on-line adaptive filtering to handle evolving events in the news. New event detection and event tracking are part of the Topic Detection and Tracking (TDT) initiative.

Online version

pdf link to the paper

Data Collection

An important task of the TDT pilot study was the creation of an appropriate test corpus and a useful approach to evaluation of the problem. The goals of creating the corpus and evaluation methodology were two-fold: (1) to make strides toward a solid definition of event as outlined in Section 2.1, and (2) to evaluate how well state of the art approaches could address the TDT tasks. The resuiting TDT corpus includes 15,863 news stories spanning July 1, 1994, through June 30, 1995. Half of the stories are randomly chosen Reuters news articles from that period; the other half are transcripts of several CNN broadcast news shows during the same period. The stories are assigned an ordering that represents the order that they appeared in the news. The average story contains 460 (210 unique) single-word features after stemming and removing stopwords.

Evaluation Metric

In the TDT setting, we have chosen to measure a system¡¯s effectiveness primarily by the miss (false negative)we skirt the threshold issue by using a Detection Error Tradeoff curve to show how false alarm and miss rates vary with respect to each other at various threshold values. and false alarm (false positive or fallout) rates.


Event Detection Method

New event detection operates in a strict on-line setting, processing stories from a news stream one at a time as they arrive. Our approach to the problem is a modification of the well-known single pass clustering algorithm. Our algorithm processes each new story on the stream sequentially, as follows:

  • Use feature extraction and selection techniques to

build a query representation for the story content.

  • Determine the query initial threshold by evaluating

the new story with the query.

  • Compare the new story against earlier queries in

memory.

  • If the story does not trigger any previous query by

exceeding its threshold, flag the story as containing a new event.

  • If the story triggers an existing query, flag the story

as not containing a new event.

  • (Optional) Add the story to the agglomeration list

of queries it triggered.

  • (Optional) Rebuild existing queries using the story.
  • Add new query to memory.

We represent the content of each story (which we assume discusses some event) as a query. If a new story triggers an existing query, the story is assumed to discuss the event represented in the query, otherwise it contains a new event.

Results Analysis

Misses occur when stories containing new events are labeled as not new. When the representation used a small number of features, misses were mostly the result of failing to weight specific event features more heavily than features descriptive of a class of events. the system was unable to detect certain events that are discussed in the news at different levels of granularity. However, we hypothesize that several of the problems revealed in the failure analysis could be resolved with a different weight assignment strategy for query features.

Event Tracking Method

the goal of the system is to begin tracking immediately. Unfortunately, events occur at different times, meaning that it is nearly impossible to use the same training and test set for each event.

Surprising Feature

It is a characteristic of news reporting that stories about the same event often occur in clumps. This effect is particularly true for unexpected events (e.g., disasters or major crimes) where the news media exhibit strong interest in a story and report in nearly endless detail about it. As the triggering event fades into the past, the stories discussing the event similarly fade.

A second characteristic of news coverage is that the people, places, and other items of interest in a story are likely not to have been mentioned very often in the past. This supposition is obviously not true for all features (e.g., the name of the President of the U.S. is likely to reoccur), but there must be something about the story that makes its appearance worthwhile.

Interesting Points

New event detection is an abstract document classification task that we have shown has reasonable solutions using a single pass clustering approach. We have presented an evaluation methodology based on miss and false alarm rates, measures that are more closely related to the task than recall and precision. System misses and false alarms were used to measure detection error in a cross-validation approach that found stable system parameters for our implementation. We described overall system performance using a bootstrap method that produced performance distributions for the TDT corpus.

Relative Work