Difference between revisions of "Learning Domain-Specific Information Extraction Patterns from the Web"

From Cohen Courses
Jump to navigationJump to search
Line 20: Line 20:
 
The noun-phrases are identified based on a heuristic-based algorithm, and the typical patterns extracted are of the type "died in <np>", "<group> claimed responsibility", etc.
 
The noun-phrases are identified based on a heuristic-based algorithm, and the typical patterns extracted are of the type "died in <np>", "<group> claimed responsibility", etc.
  
==Extracting text of Terrism Domain from the web==
+
==Extracting text of Terrorism Domain from the web==
 +
To extract more patterns relevant to the terrorism domain, the first thing was to retrieve data from the web that actually pertained to this domain. The authors didn't bother themselves with finding
  
 
==Extracting similar patterns from the web-text==
 
==Extracting similar patterns from the web-text==

Revision as of 17:08, 3 October 2011

Citation

Siddharth Patwardhan and Ellen Riloff, "Learning Domain-Specific Information Extraction Patterns from the Web", IEBeyondDoc '06 Proceedings of the Workshop on Information Extraction Beyond The Document

Online version

Click here to download

Introduction

This paper aims at Automatic Pattern Extraction from web for the task of domain-specific Information Extraction. The domain under consideration was "terrorist events". The authors started with some seed patterns extracted from the given MUC-4 terrorism corpus, and then looked over web for extracting more similar patterns that had the Semantic Affinity. The similarity metric used was Pointwise mutual information. After retrieving these additional patterns from the web, all these identified patterns were used to extract required information from the MUC-4 terrorism corpus.

Dataset

The dataset used was the MUC-4 terrorism corpus, which contains 1700 terrorism stories. Most of them are news stories related to Latin American terrorism. Each story also has answer key templates which contains the information supposed to be extracted from that story. Per the authors analysis, the dataset is difficult for an IE task, because all of the text is in upper-case, and nearly half of the stories do not pertain to a terrorist event. Even in the rest half with stories pertaining to terrorist events, many of the stories describe multiple terrorist events.

Extracting Seed Patterns

The authors used the AutoSlog-TS system [1], for extracting the seed patterns from the MUC corpus. The Autoslog-TS system basically works by extracting syntactic patterns for all the noun-phrases present in a text. The extraction of these patterns is done both for the text that is relevant to the domain and that is irrelevant to the domain, and a ranked list of the patterns is prepared based on a relevance score. The relevance score that the authors used for this task was RlogF score, which is defined as:


where is the frequency of the i-th phrase in the text that is relevant to the domain, and is the frequency of i-th pattern in the whole corpus.
The noun-phrases are identified based on a heuristic-based algorithm, and the typical patterns extracted are of the type "died in <np>", "<group> claimed responsibility", etc.

Extracting text of Terrorism Domain from the web

To extract more patterns relevant to the terrorism domain, the first thing was to retrieve data from the web that actually pertained to this domain. The authors didn't bother themselves with finding

Extracting similar patterns from the web-text

Applying the extracted patterns for IE task

Results

References