Difference between revisions of "Mayfield et al, CoNLL 2003"

From Cohen Courses
Jump to navigationJump to search
 
(7 intermediate revisions by one other user not shown)
Line 1: Line 1:
Being edited by Rui Correia
 
 
 
== Citation ==
 
== Citation ==
  
Line 11: Line 9:
 
== Summary ==
 
== Summary ==
  
In this [[Category::paper]] the authors address the problem of [[Named Entity Recognition]] using [[Support Vector Machines]] to capture transition probabilities in a lattice, a method they called SVM-lattice. Their main goal is to provide a language independent [[Named Entity Recognition] system, considering hundreds of thousands of features, which they will let the SVM decide if are relevant or not.
+
In this [[Category::paper]] the authors address the problem of [[AddressesProblem::Named Entity Recognition]] using [[UsesMethod::Support Vector Machines]] to capture transition probabilities in a lattice: a method they called SVM-lattice. The main goal is to provide a language independent [[AddressesProblem::Named Entity Recognition]] system, considering hundreds of thousands of language-independent features, which [[UsesMethod::Support Vector Machines|SVM]]'s decide if are relevant or not.
  
In most [[Named Entity Recognition]] systems, handling large numbers of features is expensive and might result in overtraining, demanding for a wise and informed feature selection. The solution proposed by the authors is to built a lattice for each sentence (where the vertexes are the taggs and the edges the possible transitions) and compute the edges transitions probabilities. When these transitions are computed, the authors apply the [[Viterbi]] algorithm to find the best path and decide on the set of tags.
+
In most [[AddressesProblem::Named Entity Recognition|NER]] systems, handling large numbers of features is expensive and might result in overtraining, demanding for a wise and informed feature selection. The solution proposed by the authors is to built a lattice for each sentence (where the vertices are the name taggs and the edges are the possible transitions) and compute the edges' transitions probabilities. When these transitions are computed, the authors apply the [[UsesMethod::Viterbi]] algorithm to find the best path and decide on the set of tags of the sentence.
  
The SVM-Lattice approach outperforms the CoNLL-2003 Shared Task baseline and a classic HMM approach as well. The fact that it can include many features with little concern for their relevance or independence makes this solution particularly well suited for  language-neutral entity recognition.
+
This strategy outperforms the [[UsesDataset::CoNLL-X|CoNLL-2003 Shared Task]] baseline and a classic [[HMM]] approach as well. The fact that it can include many features with little concern for their relevance or independence makes this solution particularly well suited for  language-neutral entity recognition.
  
 
== Brief Description of the Method ==
 
== Brief Description of the Method ==
Each sentence is processed individually. A lattice is built for each sentence where each column contains one vertex for each possible tag and is connected by an edge to every vertex in the next column that represents a valid transition. To compute these transitions, the authors exploit some important properties of [[SVM]]'s: being able to handle very high dimensional spaces and being resistant to overfitting. When these transitions are finally estimated and applied to the lattice, the authors run [[Viterbi]] to to find the most likely path, which identifies the final tag for each word of the sentence.
+
A lattice is built for each sentence where each column contains one vertex for each possible tag and is connected by an edge to every vertex in the next column that represents a valid transition. To compute these transitions, the authors exploit some important properties of [[UsesMethod::Support Vector Machines|SVM]]'s: being able to handle very high dimensional spaces and being resistant to overfitting. When these transitions are finally estimated and applied to the lattice, the authors run [[UsesMethod::Viterbi]] to find the most likely path, which identifies the final tag for each word of the sentence.
  
However, standard SVM's do not provide such probabilities. Used a parametric model to fit the posterior <math>P(y=1|f)</math> directly. Since the class-conditional densities between the margins of a SVM are exponential and the Baye's rule on two exponentials suggests using a parametric form of a sigmoid, comes that:
+
However, standard [[UsesMethod::Support Vector Machines|SVM]]'s do not provide such probabilities. To accomplish this, the authors used a parametric model to fit the posterior <math>P(y=1|f)</math> directly. Since the class-conditional densities between the margins of a [[UsesMethod::Support Vector Machines|SVM]] are exponential and the Baye's rule on two exponentials suggests using a parametric form of a sigmoid, the formulation of the posterior can be written as:
  
 
<math>
 
<math>
Line 26: Line 24:
 
</math>  
 
</math>  
  
The authors fix A=-2 and b=0.
+
where the authors fix <math>A=-2</math> and <math>b=0</math>, recognizing that a estimation of these parameters could give best results.
  
A different SVM model is trained for each transition type.
+
In order to classify a given sentence, after training, each word of the input is represented by a vector of features (such as the word itself, character n-grams, word length and position in the sentence, capitalization pattern, etc.). Each classifier (one for each transition type) is then applied to this vector to produce a margin, that is then mapped to a probability estimate. When all the probabilities have been computed, the [[UsesMethod::Viterbi]] algorithm computation takes place.
 
 
To evaluate a test set, each word of the input is represented by a vector of features (such as the word itself, character n-grams, word length and position in the sentence, capitalization patter, etc.). Each classifier is then applied to this vector to produce a margin, that is then mapped to a probability estimate. When all the probabilities have been computed, the Viterbi algorithm computation takes place.
 
  
 
== Experimental Results ==
 
== Experimental Results ==
The authors evaluated their approach in the CoNNL-2003 training and test sets. The authors compare 4 entity taggers: TnT tagger, TnT + subcat (reduction over the possible categories, and therefore transitions, removing the less common ones), SVM-Lattice (uses the same categories as TnT+subcat) and SVM-Lattice+ (that uses the result of the TnT+subcat version as a feature). The results of the F-measure for this tests are present in the next table:
+
The authors evaluated their approach using the [[UsesDataset::CoNLL-X|CoNLL-2003]] training and test sets. In this paper 4 entity taggers are compared: TnT tagger [http://arxiv.org/pdf/cs/0003055] , TnT + subcat (reduction over the possible categories, and therefore transitions, removing the less common ones), SVM-Lattice (uses the same categories as TnT+subcat) and SVM-Lattice+ (that uses the result of the TnT+subcat version as a feature). The results of the F-measure for these tests are present in the next table:
  
 
[[File:MayfieldResults.png]]
 
[[File:MayfieldResults.png]]
 +
 +
The improvement from TnT to TnT + subcat is merely due to a reduction of the considered classes. However, SVM-Lattice achieves a slightly better improvement over the TnT tagger, and the ultimate result consists of an improvement of about 3% over the TnT tagger.
 +
 +
The authors never comment on the significance of these results, and do not present results of the SVM-Lattice method with the complete set of standard taggs used in NER competitions.
 +
 +
==Related Papers ==
 +
[[RelatedPaper::Thorsten Brants. 2000.]] TnT – a statistical part-of-speech tagger. In ANLP 6, pages 224–231.[http://arxiv.org/pdf/cs/0003055]

Latest revision as of 10:29, 7 October 2011

Citation

James Mayfield, Paul McNamee, and Christine Piatko. 2003. Named Entity Recognition using Hundreds of Thousands of Features. In Proceedings of CoNLL-2003.

Online version

[1]

Summary

In this paper the authors address the problem of Named Entity Recognition using Support Vector Machines to capture transition probabilities in a lattice: a method they called SVM-lattice. The main goal is to provide a language independent Named Entity Recognition system, considering hundreds of thousands of language-independent features, which SVM's decide if are relevant or not.

In most NER systems, handling large numbers of features is expensive and might result in overtraining, demanding for a wise and informed feature selection. The solution proposed by the authors is to built a lattice for each sentence (where the vertices are the name taggs and the edges are the possible transitions) and compute the edges' transitions probabilities. When these transitions are computed, the authors apply the Viterbi algorithm to find the best path and decide on the set of tags of the sentence.

This strategy outperforms the CoNLL-2003 Shared Task baseline and a classic HMM approach as well. The fact that it can include many features with little concern for their relevance or independence makes this solution particularly well suited for language-neutral entity recognition.

Brief Description of the Method

A lattice is built for each sentence where each column contains one vertex for each possible tag and is connected by an edge to every vertex in the next column that represents a valid transition. To compute these transitions, the authors exploit some important properties of SVM's: being able to handle very high dimensional spaces and being resistant to overfitting. When these transitions are finally estimated and applied to the lattice, the authors run Viterbi to find the most likely path, which identifies the final tag for each word of the sentence.

However, standard SVM's do not provide such probabilities. To accomplish this, the authors used a parametric model to fit the posterior directly. Since the class-conditional densities between the margins of a SVM are exponential and the Baye's rule on two exponentials suggests using a parametric form of a sigmoid, the formulation of the posterior can be written as:

where the authors fix and , recognizing that a estimation of these parameters could give best results.

In order to classify a given sentence, after training, each word of the input is represented by a vector of features (such as the word itself, character n-grams, word length and position in the sentence, capitalization pattern, etc.). Each classifier (one for each transition type) is then applied to this vector to produce a margin, that is then mapped to a probability estimate. When all the probabilities have been computed, the Viterbi algorithm computation takes place.

Experimental Results

The authors evaluated their approach using the CoNLL-2003 training and test sets. In this paper 4 entity taggers are compared: TnT tagger [2] , TnT + subcat (reduction over the possible categories, and therefore transitions, removing the less common ones), SVM-Lattice (uses the same categories as TnT+subcat) and SVM-Lattice+ (that uses the result of the TnT+subcat version as a feature). The results of the F-measure for these tests are present in the next table:

MayfieldResults.png

The improvement from TnT to TnT + subcat is merely due to a reduction of the considered classes. However, SVM-Lattice achieves a slightly better improvement over the TnT tagger, and the ultimate result consists of an improvement of about 3% over the TnT tagger.

The authors never comment on the significance of these results, and do not present results of the SVM-Lattice method with the complete set of standard taggs used in NER competitions.

Related Papers

Thorsten Brants. 2000. TnT – a statistical part-of-speech tagger. In ANLP 6, pages 224–231.[3]