Difference between revisions of "Bellare 2009 generalized expectation criteria for bootstrapping extractors using record text alignment"

Latest revision as of 13:09, 2 November 2011

Citation

Generalized Expectation Criteria for Bootstrapping Extractors using Record-Text Alignment, by K. Bellare, A. McCallum. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, 2009.

Online version

This Paper is available online [1].

Summary

This paper presents a novel approach using Generalized Expectation Criteria to train a Conditional Random Field model for an IE task. In a setting where there exists a database, the authors train a CRF model for alignment with the unlabeled text using generalized expectation. Also, based on the alignment model, they also propose a usual 1st order CRF model that can extract information without relying on a DB record.

Brief description of the method

The paper present two CRF models: AlignCRF and ExtrCRF. The first is a zero-order CRF model used to predict labels for a text sequence given a matching DB record. The other model is a first-order linear-chain CRF to extract when there is no DB record to match.

Features

Extraction features (in AlignCRF and ExtrCRF)
- regular expressions detecting tokens containing all characters, all digits, or all alphanumeric
- number of characters and digits in the token (ex. [NUMCHAR=3, NUMDIGITS=1])
- domain-specific patterns for 'date', and 'pages'
- token identity, prefix/suffix, character n-grams
- presence of a token in lexicons such as "last names", "publisher names", "cities
- lexicon features within a window of 10
- regular expression feature within a window of 10
- token identity features within a window of 3

Alignment features (in AlignCRF)
- exact token match
- approximate token match after binning Jaro-Winkler edit distance between tokens
- substring token match
- prefix/suffix token match (if the prefixes/suffixes match for lengths 1,2,3,4)
- exact and approximate token matches at offsets (-1,-1) and (+1,+1) around the alignment

AlignCRF

For a database record with token sequence and label sequence $(\mathbf {x} _{1},\mathbf {y} _{1})=(<{x}_{1}[1],{x}_{1}[2],...,{x}_{1}[m]>,<{y}_{1}[1],...,{y}_{1}[m]>)$ , a text sequence $\mathbf {x} _{2}=<{x}_{2}[1],{x}_{2}[2],...,{x}_{2}[m]>$ and an alignment sequence $\mathbf {a} =<a_{1},...,a_{n}>$ where $\quad a_{i}=j$ indicates $\quad (x_{1}[j],y_{1}[j])$ is assigned to the token $\quad x_{2}[i]$ , the alignment model defines a probability distribution of the alignment sequence conditioned on the database record and the text sequence $(\mathbf {x} _{1},\mathbf {y} _{1},\mathbf {x} _{2})$ as

$p_{\Theta }(\mathbf {a} \vert \mathbf {x} _{1},\mathbf {y} _{1},\mathbf {x} _{2};\Theta )={exp(\sum _{t=1}^{n}{\Theta ^{\mathrm {T} }{\vec {f}}(a',\mathbf {x} _{1},\mathbf {y} _{1},\mathbf {x} _{2},t)}) \over Z_{\Theta }(\mathbf {x} _{1},\mathbf {y} _{1},\mathbf {x} _{2})}$

$p_{\Theta }(a_{t}\vert \mathbf {x} _{1},\mathbf {y} _{1},\mathbf {x} _{2};\Theta )={exp(\Theta ^{\mathrm {T} }{\vec {f}}(a_{t},\mathbf {x} _{1},\mathbf {y} _{1},\mathbf {x} _{2},t)) \over exp(\sum _{a'}{\Theta ^{\mathrm {T} }{\vec {f}}(a_{t},\mathbf {x} _{1},\mathbf {y} _{1},\mathbf {x} _{2},t)})}$

${\partial p_{\Theta }(a_{t}\vert \mathbf {x} _{1},\mathbf {y} _{1},\mathbf {x} _{2};\Theta ) \over \partial \Theta }=p_{\Theta }(a_{t}\vert \mathbf {x} _{1},\mathbf {y} _{1},\mathbf {x} _{2};\Theta )[{\vec {f}}(a_{t},\mathbf {x} _{1},\mathbf {y} _{1},\mathbf {x} _{2},t)-E_{p_{\Theta }(a)}({\vec {f}}(a,\mathbf {x} _{1},\mathbf {y} _{1},\mathbf {x} _{2},t))]$

The expectation criteria is determined in the following manner: for expectation features, for each label, the top N extraction features are selected by mutual information with that label. Also, the top N alignment features that have highest mutual information with correct labeling are selected as alignment criteria. The target expectations of these criteria are binned into 11 bins [0.05, 0.1, 0.2, ..., 0.9, 0.95]. [[2]] is a complete list of expectation criteria.

Given expectation criteria ${\mathcal {C}}=<\mathbf {F} ,\mathbf {P} ,\mathbf {W} >$ where $\mathbf {F} =<f_{1},...,f_{l}>$ is a list of binary feature functions, $\mathbf {P} =<p_{1},...,p_{l}>$ is target expectations, $\mathbf {W} =<w_{1},...,w_{l}>$ is weights, the objective function to optimize is the following:

${\mathcal {O}}(\theta ;{\mathcal {D}},{\mathcal {C}})=\max _{\Theta }{-\sum _{i=1}^{l}{\Delta (f_{i},p_{i},w_{i},\Theta )}-{{||\Theta ||^{2}} \over {2}}}$

where $\Delta (f,p,w,\Theta )=w({E_{p_{\Theta }}(\mathbf {A} _{f}) \over |\mathbf {A} _{f}|}-p)^{2}$ is the squared divergence and $\mathbf {A} _{f}$ is the alignment latent variables to apply expectation criteria. The authors use the L-BFGS algorithm to maximize the objective function.

ExtrCRF

This model, whose probability is denoted by $p_{\Lambda }$ , is trained by minimizing the following objective function using L-BFGS.

$O(\Lambda ;{\mathcal {D}},\Theta )=\min _{\Lambda }{\sum _{i=1}^{K}{KL(p_{\Theta }(\mathbf {y} \vert \mathbf {X} _{2}^{(i)})\vert \vert p_{\Lambda }(\mathbf {y} \vert \mathbf {X} _{2}^{(i)}))}+||\Lambda ||^{2}/2}$

Experimental Result

The authors tested this method on DBLP bibliographic database on a pruned label set (author, title, date, venue, volume, number, pages, editor, publisher, series, O). AlignCRF clearly outperforms other models, and ExtrCRF also achieves an error reduction of 20~35% compared to other methods. An interesting thing to note is that there is no evident decrease in the performance of AlignCRF compared to ExtrCRF, although AlignCRF is not using DB records. This is due to the benefit of having higher order (first-order) model and using noisy DB records in the test set for alignment.

@@ Line 1: / Line 1: @@
-A summary is coming soon from [[User::Daegunw]]!
 == Citation ==
@@ Line 11: / Line 9: @@
 == Summary ==
-This [[Category::paper]] presents an [[UsesMethod::Active Learning]] approach that is not fully supervised. In this paper, the authors propose a semi-supervised approach where only some of the sequences are asked to be labeled. Assuming that there are subsequences that the model is confident about the labels even in a sequence that is uncertain as a whole, it only asks for labels for the subsequence the model is uncertain about and the rest is labeled using the current classifier. From their experiment this approach could save about 50~60% annotation labor over fully supervised active learning in the sequential learning settings.
+This [[Category::paper]] presents a novel approach using [[UsesMethod::Generalized Expectation Criteria]] to train a [[UsesMethod::Conditional Random Field]] model for an IE task. In a setting where there exists a database, the authors train a CRF model for alignment with the unlabeled text using generalized expectation. Also, based on the alignment model, they also propose a usual 1st order CRF model that can extract information without relying on a DB record.
 == Brief description of the method ==
-The method is a pretty simple extension of a standard active learning method. The following figure describes the general active learning framework.
+The paper present two CRF models: AlignCRF and ExtrCRF. The first is a zero-order CRF model used to predict labels for a text sequence given a matching DB record. The other model is a first-order linear-chain CRF to extract when there is no DB record to match.
-[[File:Tomanek ACL2009.png]]
+=== Features ===
-The authors refer the usual active learning mode as fully supervised active learning (FuSAL). The utility function used in FuSAL is
+* Extraction features (in AlignCRF and ExtrCRF)
+** regular expressions detecting tokens containing all characters, all digits, or all alphanumeric
+** number of characters and digits in the token (ex. [NUMCHAR=3, NUMDIGITS=1])
+** domain-specific patterns for 'date', and 'pages'
+** token identity, prefix/suffix, character n-grams
+** presence of a token in lexicons such as "last names", "publisher names", "cities
+** lexicon features within a window of 10
+** regular expression feature within a window of 10
+** token identity features within a window of 3
-<math>U_{\mathbf{\lambda}}(\mathbf{x}) = 1 - P_{\mathbf{\lambda}}(\mathbf{y}^{*}\vert\mathbf{x})</math>
+* Alignment features (in AlignCRF)
+** exact token match
+** approximate token match after binning Jaro-Winkler edit distance between tokens
+** substring token match
+** prefix/suffix token match (if the prefixes/suffixes match for lengths 1,2,3,4)
+** exact and approximate token matches at offsets (-1,-1) and (+1,+1) around the alignment
-which makes the sampling method as an uncertainty sampling method.
+=== AlignCRF ===
-The problem of FuSAL in the sequence labeling scenario is that an example that has a high utility can still have parts of it that the current model can label very well, thus not contribute much to the utility of the whole. Therefore, it means we can leave some of the labels that the current model labeled if the confidence on that particular token is high enough. The authors name this as semi-supervised active learning (SeSAL). It combines the benefits of [[UsesMethod::Active Learning]] and [[UsesMethod::Bootstrapping]], which are labeling only examples with high utility and minimizing annotation effort by partially labeling examples where the model is confident about the prediction. In pseudocode, the following shows the steps that are added to the FuSAL:
+For a database record with token sequence and label sequence <math>(\mathbf{x}_{1}, \mathbf{y}_{1})=(<{x}_{1}[1], {x}_{1}[2], ..., {x}_{1}[m]>, <{y}_{1}[1], ..., {y}_{1}[m]>)</math>, a text sequence <math>\mathbf{x}_{2}=<{x}_{2}[1], {x}_{2}[2], ..., {x}_{2}[m]></math> and an alignment sequence <math>\mathbf{a}=<a_1, ..., a_n></math> where <math> \quad a_{i}=j </math> indicates <math>\quad(x_1[j], y_1[j]) </math> is assigned to the token <math>\quad x_{2}[i]</math>, the alignment model defines a probability distribution of the alignment sequence conditioned on the database record and the text sequence <math>(\mathbf{x}_{1}, \mathbf{y}_{1}, \mathbf{x}_{2})</math> as
-.1 For each example <math>p_{i}\quad</math>
+<math>
+p_{\Theta}(\mathbf{a}\vert\mathbf{x}_{1}, \mathbf{y}_{1}, \mathbf{x}_{2};\Theta)
+={exp(\sum_{t=1}^{n}{\Theta^{\mathrm{T}}\vec{f}(a',\mathbf{x}_1,\mathbf{y}_1,\mathbf{x}_2,t)})\over Z_{\Theta}(\mathbf{x}_1,\mathbf{y}_1,\mathbf{x}_2)}
+</math>
-.1.1 For each token <math>x_{j}\quad</math> and the most likely label <math>y_{j}^{*}\quad</math>
+<math>
+p_{\Theta}(a_{t}\vert\mathbf{x}_{1}, \mathbf{y}_{1}, \mathbf{x}_{2};\Theta)
+={exp(\Theta^{\mathrm{T}}\vec{f}(a_t,\mathbf{x}_1,\mathbf{y}_1,\mathbf{x}_2,t))\over
+exp(\sum_{a'}{\Theta^{\mathrm{T}}\vec{f}(a_t,\mathbf{x}_1,\mathbf{y}_1,\mathbf{x}_2,t)})}
+</math>
-.1.1.1 Compute the model's confidence in the predicted label <math>C_{\mathbf{\lambda}}(y_{j}^{*})=P_{\mathbf{\lambda}}(y_{j}=y_{j}^{*}\vert\mathbf{x})</math>
+<math>
+{\partial p_{\Theta}(a_t\vert\mathbf{x}_{1}, \mathbf{y}_{1}, \mathbf{x}_{2};\Theta) \over \partial \Theta}
+=p_{\Theta}(a_{t}\vert\mathbf{x}_{1}, \mathbf{y}_{1}, \mathbf{x}_{2};\Theta)
+[
+\vec{f}(a_t,\mathbf{x}_1,\mathbf{y}_1,\mathbf{x}_2,t)
+-E_{p_{\Theta}(a)}(\vec{f}(a,\mathbf{x}_1,\mathbf{y}_1,\mathbf{x}_2,t))
+]
+</math>
-.1.2 Remove all labels whose confidence is lower than some threshold <math>t</math>
+[[File:Bellare 1.png]]
-Since there is a bootstrapping element in the method, the size of the seed set is also important. Therefore the authors suggest running FuSAL several iterations before switching to SeSAL.
+The expectation criteria is determined in the following manner: for expectation features, for each label, the top N extraction features are selected by mutual information with that label. Also, the top N alignment features that have highest mutual information with correct labeling are selected as alignment criteria. The target expectations of these criteria are binned into 11 bins [0.05, 0.1, 0.2, ..., 0.9, 0.95]. [[http://www.cs.umass.edu/~kedarb/dbie_expts.txt]] is a complete list of expectation criteria.
-== Experimental Result ==
+Given expectation criteria  <math>\mathcal{C} = <\mathbf{F}, \mathbf{P}, \mathbf{W}></math> where <math>\mathbf{F}=<f_1,...,f_l></math> is a list of binary feature functions, <math>\mathbf{P}=<p_1,...,p_l></math> is target expectations, <math>\mathbf{W}=<w_1,...,w_l></math> is weights, the objective function to optimize is the following:
+<math>
+\mathcal{O}(\theta;\mathcal{D},\mathcal{C})=
+\max_{\Theta}{
+  -\sum_{i=1}^{l}{\Delta(f_{i}, p_{i}, w_{i}, \Theta)} - {{||\Theta||^{2}}\over{2}}
+  }
+</math>
+where <math>
+\Delta(f, p, w, \Theta) = w( {E_{p_{\Theta}}(\mathbf{A}_{f}) \over |\mathbf{A}_{f}|} - p)^{2}
+</math> is the squared divergence and
+<math>
+\mathbf{A}_{f}
+</math> is the alignment latent variables to apply expectation criteria. The authors use the L-BFGS algorithm to maximize the objective function.
+=== ExtrCRF ===
+This model, whose probability is denoted by <math>p_{\Lambda}</math>, is trained by minimizing the following objective function using L-BFGS.
-The authors tested this method on [[UsesDataset::MUC]]-7 and the oncology part of [[UsesDataset::PennBioIE]] corpus. The base learner used for the experiment is a linear-chain [[UsesMethod::Conditional Random Fields]]. Features used are orthographical features (regexp patterns), lexical and morphological features (prefix, suffix, lemmatized tokens), and contextual features (features of neighbor tokens). In terms of the number of tokens that had to be labled to reach the maximal F-score, SeSAL could save about 60% over FuSAL, and 80% over random sampling. Having high confidence was also important because it could save the model from making errors in the early stages.
+<math>
+O(\Lambda;\mathcal{D},\Theta)=\min_{\Lambda}{\sum_{i=1}^{K}{KL(p_{\Theta}(\mathbf{y}\vert\mathbf{X}_{2}^{(i)})\vert\vert p_{\Lambda}(\mathbf{y}\vert\mathbf{X}_{2}^{(i)}))}+||\Lambda||^{2}/2}
+</math>
-== Related papers ==
+== Experimental Result ==
-* [[RelatedPaper::Muslea, Minton and Knoblock, ICML 2002]]
+The authors tested this method on [[UsesDataset::DBLP]] bibliographic database on a pruned label set (author, title, date, venue, volume, number, pages, editor, publisher, series, O). AlignCRF clearly outperforms other models, and ExtrCRF also achieves an error reduction of 20~35% compared to other methods. An interesting thing to note is that there is no evident decrease in the performance of AlignCRF compared to ExtrCRF, although AlignCRF is not using DB records. This is due to the benefit of having higher order (first-order) model and using noisy DB records in the test set for alignment.
-* [[RelatedPaper::McCallum and Ngiam, ICML 98]]
+[[File:Bellare 2.png]]
+[[File:Bellare 3.png]]
-== Comment ==
+== Related papers ==
-If you're further interested in active learning for NLP, you might want to see Burr Settles' review of active learning: http://active-learning.net/  --[[User:Brendan|Brendan]] 22:51, 13 October 2011 (UTC)
+* [[RelatedPaper::Sarawagi_and_Cohen_NIPS_2004]]
+* [[RelatedPaper::Mann and McCallum, ACL 08]]

Difference between revisions of "Bellare 2009 generalized expectation criteria for bootstrapping extractors using record text alignment"

Latest revision as of 13:09, 2 November 2011

Contents

Citation

Online version

Summary

Brief description of the method

Features

AlignCRF

ExtrCRF

Experimental Result

Related papers

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools