Difference between revisions of "Bellare 2009 generalized expectation criteria for bootstrapping extractors using record text alignment"
Line 1: | Line 1: | ||
+ | A summary is coming soon from [[User::Daegunw]]! | ||
+ | |||
+ | == Citation == | ||
+ | |||
{{MyCiteconference | booktitle = Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing| coauthors = A. McCallum| date = 2009| first = K.| last = Bellare| pages = 131-140| title = Generalized Expectation Criteria for Bootstrapping Extractors using Record-Text Alignment| url = http://www.cs.umass.edu/~kedarb/papers/dbie_ge_align.pdf }} | {{MyCiteconference | booktitle = Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing| coauthors = A. McCallum| date = 2009| first = K.| last = Bellare| pages = 131-140| title = Generalized Expectation Criteria for Bootstrapping Extractors using Record-Text Alignment| url = http://www.cs.umass.edu/~kedarb/papers/dbie_ge_align.pdf }} | ||
+ | |||
+ | == Online version == | ||
This [[Category::Paper]] is available online [http://www.cs.umass.edu/~kedarb/papers/dbie_ge_align.pdf]. | This [[Category::Paper]] is available online [http://www.cs.umass.edu/~kedarb/papers/dbie_ge_align.pdf]. | ||
− | + | == Summary == | |
+ | |||
+ | This [[Category::paper]] presents an [[UsesMethod::Active Learning]] approach that is not fully supervised. In this paper, the authors propose a semi-supervised approach where only some of the sequences are asked to be labeled. Assuming that there are subsequences that the model is confident about the labels even in a sequence that is uncertain as a whole, it only asks for labels for the subsequence the model is uncertain about and the rest is labeled using the current classifier. From their experiment this approach could save about 50~60% annotation labor over fully supervised active learning in the sequential learning settings. | ||
+ | |||
+ | == Brief description of the method == | ||
+ | |||
+ | The method is a pretty simple extension of a standard active learning method. The following figure describes the general active learning framework. | ||
+ | |||
+ | [[File:Tomanek ACL2009.png]] | ||
+ | |||
+ | The authors refer the usual active learning mode as fully supervised active learning (FuSAL). The utility function used in FuSAL is | ||
+ | |||
+ | <math>U_{\mathbf{\lambda}}(\mathbf{x}) = 1 - P_{\mathbf{\lambda}}(\mathbf{y}^{*}\vert\mathbf{x})</math> | ||
+ | |||
+ | which makes the sampling method as an uncertainty sampling method. | ||
+ | |||
+ | The problem of FuSAL in the sequence labeling scenario is that an example that has a high utility can still have parts of it that the current model can label very well, thus not contribute much to the utility of the whole. Therefore, it means we can leave some of the labels that the current model labeled if the confidence on that particular token is high enough. The authors name this as semi-supervised active learning (SeSAL). It combines the benefits of [[UsesMethod::Active Learning]] and [[UsesMethod::Bootstrapping]], which are labeling only examples with high utility and minimizing annotation effort by partially labeling examples where the model is confident about the prediction. In pseudocode, the following shows the steps that are added to the FuSAL: | ||
+ | |||
+ | 3.1 For each example <math>p_{i}\quad</math> | ||
+ | |||
+ | 3.1.1 For each token <math>x_{j}\quad</math> and the most likely label <math>y_{j}^{*}\quad</math> | ||
+ | |||
+ | 3.1.1.1 Compute the model's confidence in the predicted label <math>C_{\mathbf{\lambda}}(y_{j}^{*})=P_{\mathbf{\lambda}}(y_{j}=y_{j}^{*}\vert\mathbf{x})</math> | ||
+ | |||
+ | 3.1.2 Remove all labels whose confidence is lower than some threshold <math>t</math> | ||
+ | |||
+ | Since there is a bootstrapping element in the method, the size of the seed set is also important. Therefore the authors suggest running FuSAL several iterations before switching to SeSAL. | ||
+ | |||
+ | == Experimental Result == | ||
+ | |||
+ | The authors tested this method on [[UsesDataset::MUC]]-7 and the oncology part of [[UsesDataset::PennBioIE]] corpus. The base learner used for the experiment is a linear-chain [[UsesMethod::Conditional Random Fields]]. Features used are orthographical features (regexp patterns), lexical and morphological features (prefix, suffix, lemmatized tokens), and contextual features (features of neighbor tokens). In terms of the number of tokens that had to be labled to reach the maximal F-score, SeSAL could save about 60% over FuSAL, and 80% over random sampling. Having high confidence was also important because it could save the model from making errors in the early stages. | ||
+ | |||
+ | == Related papers == | ||
+ | |||
+ | * [[RelatedPaper::Muslea, Minton and Knoblock, ICML 2002]] | ||
+ | * [[RelatedPaper::McCallum and Ngiam, ICML 98]] | ||
+ | |||
+ | |||
+ | == Comment == | ||
+ | |||
+ | If you're further interested in active learning for NLP, you might want to see Burr Settles' review of active learning: http://active-learning.net/ --[[User:Brendan|Brendan]] 22:51, 13 October 2011 (UTC) |
Revision as of 17:57, 31 October 2011
A summary is coming soon from Daegunw!
Contents
Citation
Generalized Expectation Criteria for Bootstrapping Extractors using Record-Text Alignment, by K. Bellare, A. McCallum. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, 2009.
Online version
This Paper is available online [1].
Summary
This paper presents an Active Learning approach that is not fully supervised. In this paper, the authors propose a semi-supervised approach where only some of the sequences are asked to be labeled. Assuming that there are subsequences that the model is confident about the labels even in a sequence that is uncertain as a whole, it only asks for labels for the subsequence the model is uncertain about and the rest is labeled using the current classifier. From their experiment this approach could save about 50~60% annotation labor over fully supervised active learning in the sequential learning settings.
Brief description of the method
The method is a pretty simple extension of a standard active learning method. The following figure describes the general active learning framework.
The authors refer the usual active learning mode as fully supervised active learning (FuSAL). The utility function used in FuSAL is
which makes the sampling method as an uncertainty sampling method.
The problem of FuSAL in the sequence labeling scenario is that an example that has a high utility can still have parts of it that the current model can label very well, thus not contribute much to the utility of the whole. Therefore, it means we can leave some of the labels that the current model labeled if the confidence on that particular token is high enough. The authors name this as semi-supervised active learning (SeSAL). It combines the benefits of Active Learning and Bootstrapping, which are labeling only examples with high utility and minimizing annotation effort by partially labeling examples where the model is confident about the prediction. In pseudocode, the following shows the steps that are added to the FuSAL:
3.1 For each example
3.1.1 For each token and the most likely label
3.1.1.1 Compute the model's confidence in the predicted label
3.1.2 Remove all labels whose confidence is lower than some threshold
Since there is a bootstrapping element in the method, the size of the seed set is also important. Therefore the authors suggest running FuSAL several iterations before switching to SeSAL.
Experimental Result
The authors tested this method on MUC-7 and the oncology part of PennBioIE corpus. The base learner used for the experiment is a linear-chain Conditional Random Fields. Features used are orthographical features (regexp patterns), lexical and morphological features (prefix, suffix, lemmatized tokens), and contextual features (features of neighbor tokens). In terms of the number of tokens that had to be labled to reach the maximal F-score, SeSAL could save about 60% over FuSAL, and 80% over random sampling. Having high confidence was also important because it could save the model from making errors in the early stages.
Related papers
Comment
If you're further interested in active learning for NLP, you might want to see Burr Settles' review of active learning: http://active-learning.net/ --Brendan 22:51, 13 October 2011 (UTC)