Learning Domain-Specific Information Extraction Patterns from the Web
Contents
Citation
Siddharth Patwardhan and Ellen Riloff, "Learning Domain-Specific Information Extraction Patterns from the Web", IEBeyondDoc '06 Proceedings of the Workshop on Information Extraction Beyond The Document
Online version
Introduction
This paper aims at Automatic Pattern Extraction from web for the task of domain-specific Information Extraction. The domain under consideration was "terrorist events". The authors started with some seed patterns extracted from the given MUC-4 terrorism corpus, and then looked over web for extracting more similar patterns that had the Semantic Affinity. The similarity metric used was Pointwise mutual information. After retrieving these additional patterns from the web, all these identified patterns were used to extract required information from the MUC-4 terrorism corpus.
Dataset
The dataset used was the MUC-4 terrorism corpus, which contains 1700 terrorism stories. Most of them are news stories related to Latin American terrorism. Each story also has answer key templates which contains the information supposed to be extracted from that story. Per the authors analysis, the dataset is difficult for an IE task, because all of the text is in upper-case, and nearly half of the stories do not pertain to a terrorist event. Even in the rest half with stories pertaining to terrorist events, many of the stories describe multiple terrorist events.