Difference between revisions of "Learning Domain-Specific Information Extraction Patterns from the Web"

From Cohen Courses
Jump to navigationJump to search
 
(16 intermediate revisions by the same user not shown)
Line 7: Line 7:
  
 
== Introduction ==
 
== Introduction ==
This [[Category::paper]] aims at [[AddressesProblem::Automatic Pattern Extraction]] from web for the task of domain-specific [[AddressesProblem::Information Extraction]]. The domain under consideration was "terrorist events". The authors started with some seed patterns extracted from the given MUC-4 terrorism corpus, and
+
This [[Category::paper]] aims at [[AddressesProblem::Automatic Pattern Extraction]] from web for the task of domain-specific Information Extraction. The domain under consideration was "terrorist events". The authors started with some seed patterns extracted from the given MUC-4 terrorism corpus, and then looked over web for extracting more similar patterns that had the required [[UsesMethod::Semantic Affinity]] for the semantic classes belonging the semantic classes identified for the terrorism domain. The similarity metric used was [[UsesMethod::Pointwise mutual information]]. After retrieving these additional patterns from the web, all these identified patterns were used to extract required information from the MUC-4 terrorism corpus.
It makes use of [[UsesMethod::Boosting]], particularly [[Boosting|Ada-Boost]], with fixed depth decision trees. The authors implement two separate systems for SRL using syntactic features at different levels. The first system, <math>PP{_U}{_P}{_C}</math> uses shallow syntactic features from phrase-chunks and clauses (generated using UPC chunker and clauser), while the other system <math>FP{_C}{_H}{_A}</math> uses deep syntactic features obtained from (Charniak's) syntactic parse trees. The performance of SRL is evaluated first separately for these systems, and then on a combination of these systems.
 
  
 
==Dataset==
 
==Dataset==
The dataset used was the [[UsesDataset::PropBank]] corpus, which is the Penn Treebank corpus with semantic role annotation.
+
The dataset used was the [[UsesDataset::MUC]]-4 terrorism corpus, which contains 1700 terrorism stories. Most of them are news stories related to Latin American terrorism. Each story also has answer key templates which contains the information supposed to be extracted from that story. Per the authors analysis, the dataset is difficult for an IE task, because all of the text is in upper-case, and nearly half of the stories do not pertain to a terrorist event. Even in the rest half with stories pertaining to terrorist events, many of the stories describe multiple terrorist events. In addition to this authors downloaded 6182 news articles related to terrorism from CNN News website (cnn.com), and used them for the task of extracting more patterns.
  
==Methodology==
+
==Extracting Seed Patterns==
As the chunked or parsed data is hierarchical in nature, the authors first pre-processed the data to <I>sequentialize</I> it. Sequentialization on chunked data was done by selecting the top level chunks for each clause identified. Sequentialization on parsed data was done by selecting sibling nodes of the predicate (verb) node along the path from the predicate node to the root node of the syntactic parse tree. The F-scores for this sequentialization process were 97.79 and 94.91 for the chunk-sequentialization and the parse-sequentialization respectively.<br>
+
The authors used the AutoSlog-TS system [1], for extracting the seed patterns from the MUC corpus. The Autoslog-TS system basically works by extracting syntactic patterns for all the noun-phrases present in a text. The extraction of these patterns is done both for the text that is relevant to the domain and that is irrelevant to the domain, and a ranked list of the patterns is prepared based on a relevance score. The relevance score that the authors used for this task was RlogF score, which is defined as:<br>
The nodes or chunks selected after sequentialization were then tagged with BIO labels for falling at the beginning, inside or outside of a semantic argument respectively. A total of 37 semantic argument types (or semantic roles) were considered and therefore a total of 32*2+1=75 labels were used for labeling the sequential data.<br>
 
Learning algorithm was then applied to this labeled training data. Ada-boost was used with fixed depth decision trees. The decision trees had a maximum depth of 4, and used syntactic features as we shall see in the next section. The problem was modeled as an OVA classification and Ada-boost was thus used for binary classification of the syntactic constituents (chunks for <math>PP{_U}{_P}{_C}</math> and parse-tree nodes for <math>FP{_C}{_H}{_A}</math>). Additional constraints like BIO structure and non-overlapping of the arguments were applied to the classification task.
 
  
==Features Used==
+
<math>RlogF (pattern{_i}) = log{_2}(relfreq{_i}) \cdot P(relevant|pattern{_i})</math><br>
The following syntactic features, categorized under 4 categories were used (additional details can be obtained from the references cited in the paper):<br><br>
+
 
<p><b>(1) On the verb predicate</b>:<br>
+
where <math>relfreq{_i}</math> is the frequency of the i-th phrase in the text that is relevant to the domain, and <math>pattern{_i}</math> is the frequency of i-th pattern in the whole corpus.<br>
      <b>Form; Lemma; POS tag; Chunk type and Type of verb phrase</b> in which verb is included: single-word or multi-word;
+
The noun-phrases are identified based on a heuristic-based algorithm, and the typical patterns extracted are of the type "died in <np>", "<group> claimed responsibility", etc.
      <b>Verb voice</b>: active, passive, copulative, infinitive, or progressive; Binary flag indicating if the verb is a    <b>start/end</b> of a clause.<br>
+
 
      <b>Subcategorization</b>, i.e., the phrase structure rule expanding the verb parent node.</p><br>
+
==Extracting text of Terrorism Domain from the web==
<p><b>(2) On the focus constituent</b>:<br>
+
To extract more patterns relevant to the terrorism domain, the first thing was to retrieve data from the web that actually pertained to this domain. The authors didn't bother themselves with the task of web-page classification into relevant or non-relevant text for this domain. Instead, they fired the CNN news server (cnn.com) with some specific search queries aimed at retrieving terrorism related articles(using Google search APIs) and collected 6182 news articles related to terrorism.
        <b>Type; Head</b>: extracted using common head-word rules; if the first element is a PP chunk, then the head of the first NP is  extracted;<br> <b>First and last words and POS tags</b> of the constituent. <br><b>POS sequence</b>: if it is less than 5 tags long; 2/3/4-grams`of the POS sequence. <b>Bag-of-words</b> of nouns, adjectives, and adverbs in the constituent.<br>
+
 
<b> TOP sequence</b>: sequence of types of the top-most syntactic elements in the constituent (if it is less than 5 elements long); in the case of full parsing this corresponds to the right-hand side of the rule expanding the constituent node; <b>2/3/4-grams</b> of the TOP sequence.<br><b> Governing category</b> as described in (Gildea and Jurafsky,2002).<br><b>NamedEnt</b>, indicating if the constituent embeds or strictly-matches a named entity along with its type. <br><b> TMP</b>, indicating if the constituent embeds or strictly matches a temporal keyword (extracted from AM-TMP arguments of the training set).
+
==Extracting similar patterns from the web-text==
</p><br>
+
Similar to the exercise done with the text in MUC corpus, all possible patterns were extracted from the news-stories data downloaded from the web. Now the task was to extract the patterns from these web news-stories that were similar and as useful as the seed patterns. For this task, a two-step approach was taken. First, similar patterns from the web-text were extracted using PMI as the metric. If a pattern in a news-story co-occurred with a seed pattern in the same sentence, that pattern was selected in the first step. In the second step the "semantic affinity" of this pattern was calculated. Semantic affinity is a measure to find how much a pattern relates to a particular semantic class. In other words it judges the pattern's capability to extract information relevant to a particular semantic class. In context of terrorism domain, the identified semantic classes were: target, victim, perpetrator, organization, weapon and other. Mathematically semantic affinity for a pattern is defined as:<br>
<p><b>(3) Context of the focus constituent</b>:<br>
+
 
<b>Previous and following words and POS tags</b> of the constituent. <br> The same features characterizing focus constituents are
+
<math>affinity{_p}{_a}{_t}{_t}{_e}{_r}{_n}=f{_c}{_l}{_a}{_s}{_s}/f{_a}{_l}{_l} \cdot log{_2}f{_c}{_l}{_a}{_s}{_s}</math><br>
extracted for the two <b>previous and following tokens</b>, provided they are inside the clause boundaries of the codified
+
 
region.</p><br>
+
where <math>f{_c}{_l}{_a}{_s}{_s}</math> is the frequency of occurrence of the pattern where it had a noun-phrase from the semantic class "class", and <math>f{_a}{_l}{_l}</math> is the total frequency of occurrence of that pattern in the corpus.
<p><b>(4) Relation between predicate and constituent</b>:<br>
 
<b>elative position</b>; Distance in words and chunks; <b>Level of embedding</b> with respect to the constituent: in number
 
of clauses.<br><b> Constituent path</b> as described in (Gildea and Jurafsky,2002); All <b>3/4/5-grams</b> of path constituents beginning at the verb predicate or ending at the constituent.<br><b> Partial parsing path</b> as described in (Carreras et al.,
 
2004); All <b>3/4/5-grams</b> of path elements beginning at the verb predicate or ending at the constituent. <br><b>Syntactic frame</b> as described by Xue and Palmer (2004)</p>
 
  
 
==Experiments and Results==
 
==Experiments and Results==
As the training data, class and feature spaces were huge, the authors employed some filtering and simplification. First, infrequently-occurring labels were discarded and the 41 most frequent labels in the case of <math>PP{_U}{_P}{_C}</math> and the 35 most frequent in the case of <math>FP{_C}{_H}{_A}</math> were selected. The remaining labels where cumulatively tagged “other”, and were treated as "O" constituent whenever the system assigned this label to a constituent. Second, those features occurring less than 15 times in the training set were discarded.  The final number of features came down to 105,175 in the case of <math>PP{_U}{_P}{_C}</math> system and 80,742 in the case of <math>PP{_U}{_P}{_C}</math> system.<br>
+
The complete set of extracted patterns (seed and those learnt from the web-data) were used to identify target and victim information from the MUC-4 test corpus. The average results in Precision, Recall and F-score are as presented below. Baseline scores are for the experiment with just the seed patterns; n+baseline scores are for information extraction with the seed patterns plus n patterns from the larger set learnt from the web-data. Note that these patterns were ranked according to their semantic affinity scores.<br>
 +
 
 +
[[File:riloff.jpg]]
  
The results obtained by these individual systems, and also their combined variant, on the development set, is presented below in Fig-1 ("Perfect Props" denotes the accuracy in finding the correct predicate or the verb). It's evident that a deep-parsing based system outperformed the shallow-parsing one, but what's noticeable is the fact that the latter performed competitively. The authors found that the arguments predicted by these individual systems were quite different, and this made them combine the two systems so as to get the best out of each of them. The final results of the combined system are shown in Fig-2.<br>
+
==References==
[[File:fig1.jpg]]<br>
+
[http://www.cs.utah.edu/~riloff/pdfs/official-sundance-tr.pdf Ellen Riloff and William Phillips, "An Introduction to the Sundance and AutoSlog Systems"]
            Fig-1<br>
 
[[File:fig2.jpg]]<br>
 
            Fig-2<br>
 

Latest revision as of 18:35, 3 October 2011

Citation

Siddharth Patwardhan and Ellen Riloff, "Learning Domain-Specific Information Extraction Patterns from the Web", IEBeyondDoc '06 Proceedings of the Workshop on Information Extraction Beyond The Document

Online version

Click here to download

Introduction

This paper aims at Automatic Pattern Extraction from web for the task of domain-specific Information Extraction. The domain under consideration was "terrorist events". The authors started with some seed patterns extracted from the given MUC-4 terrorism corpus, and then looked over web for extracting more similar patterns that had the required Semantic Affinity for the semantic classes belonging the semantic classes identified for the terrorism domain. The similarity metric used was Pointwise mutual information. After retrieving these additional patterns from the web, all these identified patterns were used to extract required information from the MUC-4 terrorism corpus.

Dataset

The dataset used was the MUC-4 terrorism corpus, which contains 1700 terrorism stories. Most of them are news stories related to Latin American terrorism. Each story also has answer key templates which contains the information supposed to be extracted from that story. Per the authors analysis, the dataset is difficult for an IE task, because all of the text is in upper-case, and nearly half of the stories do not pertain to a terrorist event. Even in the rest half with stories pertaining to terrorist events, many of the stories describe multiple terrorist events. In addition to this authors downloaded 6182 news articles related to terrorism from CNN News website (cnn.com), and used them for the task of extracting more patterns.

Extracting Seed Patterns

The authors used the AutoSlog-TS system [1], for extracting the seed patterns from the MUC corpus. The Autoslog-TS system basically works by extracting syntactic patterns for all the noun-phrases present in a text. The extraction of these patterns is done both for the text that is relevant to the domain and that is irrelevant to the domain, and a ranked list of the patterns is prepared based on a relevance score. The relevance score that the authors used for this task was RlogF score, which is defined as:


where is the frequency of the i-th phrase in the text that is relevant to the domain, and is the frequency of i-th pattern in the whole corpus.
The noun-phrases are identified based on a heuristic-based algorithm, and the typical patterns extracted are of the type "died in <np>", "<group> claimed responsibility", etc.

Extracting text of Terrorism Domain from the web

To extract more patterns relevant to the terrorism domain, the first thing was to retrieve data from the web that actually pertained to this domain. The authors didn't bother themselves with the task of web-page classification into relevant or non-relevant text for this domain. Instead, they fired the CNN news server (cnn.com) with some specific search queries aimed at retrieving terrorism related articles(using Google search APIs) and collected 6182 news articles related to terrorism.

Extracting similar patterns from the web-text

Similar to the exercise done with the text in MUC corpus, all possible patterns were extracted from the news-stories data downloaded from the web. Now the task was to extract the patterns from these web news-stories that were similar and as useful as the seed patterns. For this task, a two-step approach was taken. First, similar patterns from the web-text were extracted using PMI as the metric. If a pattern in a news-story co-occurred with a seed pattern in the same sentence, that pattern was selected in the first step. In the second step the "semantic affinity" of this pattern was calculated. Semantic affinity is a measure to find how much a pattern relates to a particular semantic class. In other words it judges the pattern's capability to extract information relevant to a particular semantic class. In context of terrorism domain, the identified semantic classes were: target, victim, perpetrator, organization, weapon and other. Mathematically semantic affinity for a pattern is defined as:


where is the frequency of occurrence of the pattern where it had a noun-phrase from the semantic class "class", and is the total frequency of occurrence of that pattern in the corpus.

Experiments and Results

The complete set of extracted patterns (seed and those learnt from the web-data) were used to identify target and victim information from the MUC-4 test corpus. The average results in Precision, Recall and F-score are as presented below. Baseline scores are for the experiment with just the seed patterns; n+baseline scores are for information extraction with the seed patterns plus n patterns from the larger set learnt from the web-data. Note that these patterns were ranked according to their semantic affinity scores.

Riloff.jpg

References

Ellen Riloff and William Phillips, "An Introduction to the Sundance and AutoSlog Systems"