Difference between revisions of "Paper:Collins, ACL 2002"
(4 intermediate revisions by the same user not shown) | |||
Line 4: | Line 4: | ||
[http://www.ai.mit.edu/people/mcollins/papers/finalNEacl2002.ps Here] is the online version of the paper. | [http://www.ai.mit.edu/people/mcollins/papers/finalNEacl2002.ps Here] is the online version of the paper. | ||
== Summary == | == Summary == | ||
− | In this [[Category::paper]] the author describes two algorithms, wich rerank the top N hypotheses from a maximum entropy tagger, and apply them to the task of recovering named-entity boundaries. Since the task at hand can be framed as a tagging task - to tag each word as being either the start of an entity (S), a continuation of an entity (C), or not to be part of ab entity at all (N), the author uses state of the art [[maximum entropy]] tagger similar to the one described in [[Ratnaparkhi EMNLP 1996]] as a baseline model. The same [[maximum entropy | Max-ent]] tagger is used to generate the N=20 most probable segmentations for each input sentence, along with their probabilities. Author's aim is to come up with reranking strategies for the test data candidates in order to improve the precision and recall. | + | In this [[Category::paper]] the author describes two algorithms, wich rerank the top N hypotheses from a maximum entropy tagger, and apply them to the task of [[AddressesProblem::Named Entity Recognition|recovering named-entity boundaries]]. Since the task at hand can be framed as a tagging task - to tag each word as being either the start of an entity (S), a continuation of an entity (C), or not to be part of ab entity at all (N), the author uses state of the art [[maximum entropy]] tagger similar to the one described in [[Ratnaparkhi EMNLP 1996]] as a baseline model. The same [[UsesMethod::maximum entropy | Max-ent]] tagger is used to generate the N=20 most probable segmentations for each input sentence, along with their probabilities. Author's aim is to come up with reranking strategies for the test data candidates in order to improve the precision and recall. |
The author considers various global features for each candidate tagged sequence. Most of these features are anchored on entity boundaries in the candidate segmentation. Each candidate tagged sequence <math>x</math>, proposed by the [[maximum entropy | Max-ent]] tagger, is represented by the log probability <math>L(x)</math> from the tagger, as well as the values of the global features <math>h_s(x)</math> for <math>s = 1</math> ... <math>m</math>, where <math>m</math> are the no. of global features. The two algorithms described in the next section blend these two sources of information (global features and log probability) to improve upon a strategy which just takes the candidate from the tagger with the highest score for <math>L(x)</math>. | The author considers various global features for each candidate tagged sequence. Most of these features are anchored on entity boundaries in the candidate segmentation. Each candidate tagged sequence <math>x</math>, proposed by the [[maximum entropy | Max-ent]] tagger, is represented by the log probability <math>L(x)</math> from the tagger, as well as the values of the global features <math>h_s(x)</math> for <math>s = 1</math> ... <math>m</math>, where <math>m</math> are the no. of global features. The two algorithms described in the next section blend these two sources of information (global features and log probability) to improve upon a strategy which just takes the candidate from the tagger with the highest score for <math>L(x)</math>. | ||
Line 17: | Line 17: | ||
The ranking function can also be written as <math>F(x,w)=w.h(x)</math> if <math>h(x)</math> is defined as <math>\{L(x),h_1(x) \ldots h_m(x)\}</math>. | The ranking function can also be written as <math>F(x,w)=w.h(x)</math> if <math>h(x)</math> is defined as <math>\{L(x),h_1(x) \ldots h_m(x)\}</math>. | ||
− | === [[Boosting]] for ranking === | + | === [[UsesMethod::Boosting]] for ranking === |
The [[boosting]] algorithm in the paper is the one described in [[Paper:Discriminative Reranking for Natural Language Parsing,Collins 2000 | Collins 2000]], where the method can be considered to be a greedy algorithm for finding the parameters <math>w</math> that minimize the loss function | The [[boosting]] algorithm in the paper is the one described in [[Paper:Discriminative Reranking for Natural Language Parsing,Collins 2000 | Collins 2000]], where the method can be considered to be a greedy algorithm for finding the parameters <math>w</math> that minimize the loss function | ||
Line 35: | Line 35: | ||
and then makes the update <math>w_{k^*} = w_{k^*} + \delta^*</math>. | and then makes the update <math>w_{k^*} = w_{k^*} + \delta^*</math>. | ||
− | === [[Voted Perceptron]] for ranking === | + | === [[UsesMethod::Voted Perceptron]] for ranking === |
+ | The following steps define the perceptron training algorithm for ranking problems. | ||
+ | |||
+ | '''Define:''' <math>F(x,w)=w.h(x)</math> | ||
+ | |||
+ | '''Input:''' Examples <math>x_{ij}</math> with feature vectors <math>h(x_{ij})</math> | ||
+ | |||
+ | '''Initialization:''' Set parameters <math>w^0=0</math> | ||
+ | |||
+ | '''For''' <math>i = 1 \ldots n</math> | ||
+ | |||
+ | <math>j = argmax_{j=1...n_{i}} F(x_{ij},w^{i-1})</math> | ||
+ | '''If''' <math>(j=1)</math> '''Then''' <math>w^i = w^{i-1}</math> | ||
+ | '''Else''' <math>w^i = w^{i-1} + h(x_{i1}) - h(x_{ij})</math> | ||
+ | |||
+ | '''Output:''' Parameter vectors <math>w^i</math> for <math>i = 1 \ldots n</math> | ||
+ | |||
+ | The algorithm maintains a parameter vector <math>w</math>, which is initially set to be all zeros. The algorithm then makes a pass over the training set, at each training example storing a parameter vector <math>w^i</math> for <math>i = 1 \ldots m</math>. The parameter vector is only modified when a mistake is made on an example. The update here involves adding the difference of the offending examples' representations. In the voted perceptron all parameter vectors <math>w^i</math> for <math>i = 1 \ldots n</math> are stored. Each of these <math>n</math> parameter settings "vote" for a candidate, which is the highest ranking candidate of each parameter setting given by <math>x_k</math>, where <math>k = argmax_{j} w^i.h(x_j)</math>. The candidate with the highest no. of votes is returned as the most likely candidate. | ||
== Experimental Result == | == Experimental Result == | ||
+ | Following table shows the Precision (P), Recall (R) and F-measure (F) for the three tagging methods. Figures in parantheses are the relative improvements in error rate over the [[maximum entropy | Max-ent]] model. All figures are percentages. | ||
+ | [[File:Collins_results_table.png]] | ||
== Related papers == | == Related papers == | ||
+ | [[Paper:Discriminative Reranking for Natural Language Parsing,Collins 2000 | Collins 2000]] used boosting for to rerank the output of an existing probabilistic parser and compared with Markov Random fields. |
Latest revision as of 18:28, 29 October 2011
Contents
Citation
Ranking Algorithms for Named-Entity Extraction: Boosting and the Voted Perceptron, Collins, ACL 2002
Online Version
Here is the online version of the paper.
Summary
In this paper the author describes two algorithms, wich rerank the top N hypotheses from a maximum entropy tagger, and apply them to the task of recovering named-entity boundaries. Since the task at hand can be framed as a tagging task - to tag each word as being either the start of an entity (S), a continuation of an entity (C), or not to be part of ab entity at all (N), the author uses state of the art maximum entropy tagger similar to the one described in Ratnaparkhi EMNLP 1996 as a baseline model. The same Max-ent tagger is used to generate the N=20 most probable segmentations for each input sentence, along with their probabilities. Author's aim is to come up with reranking strategies for the test data candidates in order to improve the precision and recall.
The author considers various global features for each candidate tagged sequence. Most of these features are anchored on entity boundaries in the candidate segmentation. Each candidate tagged sequence , proposed by the Max-ent tagger, is represented by the log probability from the tagger, as well as the values of the global features for ... , where are the no. of global features. The two algorithms described in the next section blend these two sources of information (global features and log probability) to improve upon a strategy which just takes the candidate from the tagger with the highest score for .
The author has shown that existing reranking methods - boosting and voted perceptron, are useful for a new domain, named-entity tagging. Another contribution is the suggestion of various global features of the candidate segmentations which give improvements on this task.
Brief description of the methods
Pairs of the form , where each is a sentence and each is the correct tag sequence for that sentence, serve as the training data. denotes the set of candidates (top N outputs from the Max-ent tagger) for each , where is the th candidate of th sentence. is the probability that the base model assigns to . Hence =log. The set of global features are represented as for ... . The parameters of the model are a vector of parameters, . The ranking function is given as
The ranking function can also be written as if is defined as .
Boosting for ranking
The boosting algorithm in the paper is the one described in Collins 2000, where the method can be considered to be a greedy algorithm for finding the parameters that minimize the loss function
As an initial step, is set to be
and all other parameters for ... are set to be zero. The algorithm then proceeds for N iterations. At each iteration, a single feature is chosen, and its weight is updated. Suppose the current parameter values are , and a single feature is chosen, its weight being updated through an increment , i.e., . Then the new loss, after this parameter update, will be
where . The boosting algorithm chooses the feature/update pair which is optimal in terms of minimizing the loss function, i.e.,
and then makes the update .
Voted Perceptron for ranking
The following steps define the perceptron training algorithm for ranking problems.
Define:
Input: Examples with feature vectors
Initialization: Set parameters
For
If Then Else
Output: Parameter vectors for
The algorithm maintains a parameter vector , which is initially set to be all zeros. The algorithm then makes a pass over the training set, at each training example storing a parameter vector for . The parameter vector is only modified when a mistake is made on an example. The update here involves adding the difference of the offending examples' representations. In the voted perceptron all parameter vectors for are stored. Each of these parameter settings "vote" for a candidate, which is the highest ranking candidate of each parameter setting given by , where . The candidate with the highest no. of votes is returned as the most likely candidate.
Experimental Result
Following table shows the Precision (P), Recall (R) and F-measure (F) for the three tagging methods. Figures in parantheses are the relative improvements in error rate over the Max-ent model. All figures are percentages.
Related papers
Collins 2000 used boosting for to rerank the output of an existing probabilistic parser and compared with Markov Random fields.