Difference between revisions of "Mayfield et al, CoNLL 2003"

From Cohen Courses
Jump to navigationJump to search
 
(14 intermediate revisions by one other user not shown)
Line 1: Line 1:
Being edited by Rui Correia
 
 
 
== Citation ==
 
== Citation ==
  
Raymond Lau, Ronald Rosenfeld and Salim Roukos. Adaptive Language Modeling Using the Maximum Entropy Principle. In Proceedings of the ARPA Human Language Technology Workshop, published as Human Language Technology, pages 108–113. Morgan Kaufmann, March 1993.
+
James Mayfield, Paul McNamee, and Christine Piatko. 2003. Named Entity Recognition using Hundreds of Thousands of Features. In Proceedings of CoNLL-2003.
  
 
== Online version ==
 
== Online version ==
  
[http://newdesign.aclweb.org/anthology-new/H/H93/H93-1021.pdf ACL WEB]
+
[http://www.cnts.ua.ac.be/conll2003/pdf/18487may.pdf]
  
 
== Summary ==
 
== Summary ==
  
In this [[Category::paper]] the authors focus on the development of [[AddressesProblem::Language Modeling|Language Models]] using the [[UsesMethod::Maximum entropy|Maximum Entropy]] Principle, in order to combine evidence from multiple sources (for example: trigrams and long distance triggers).
+
In this [[Category::paper]] the authors address the problem of [[AddressesProblem::Named Entity Recognition]] using [[UsesMethod::Support Vector Machines]] to capture transition probabilities in a lattice: a method they called SVM-lattice. The main goal is to provide a language independent [[AddressesProblem::Named Entity Recognition]] system, considering hundreds of thousands of language-independent features, which [[UsesMethod::Support Vector Machines|SVM]]'s decide if are relevant or not.
 
 
The state-of-the-art language model, by the writing time of the paper, was a "static" trigram model, where the probability of the next word is decided by looking merely at the two words that precedes it. By definition, this model, after trained, always derives the same word for the same sequence of two words, not being able to "adapt" to different texts and contexts. In the present paper the authors developed an Adaptive Model, i.e., a model that changes estimates as a result of "seeing" some of the text. This new interpretation allows one to process large heterogeneous data sources (different writing styles and/or topics) and does not require to be trained in the same domain that it will be used for.
 
 
 
The way this domain variation is capture in the model is through the concept of "Trigger Pairs". If a word sequence <math>A</math> is significantly correlated with another word sequence <math>B</math>, one can say that <math>A \rarr B</math> this is considered a trigger pair. This phenomenon can be seen in the figure below, where the probability of the word "WINTER" is shown as a function of the number of previous occurrences of the word "SUMMER".
 
 
 
[[File:Triggerpairs.png]]
 
 
 
The adaptive model that was developed in this paper is a solution to combine several estimates <math>P(w|h)</math> derived from several triggers, where <math>h</math> is the section of the document processed so far and <math>w</math> is the next word to predict. They used a Maximum Likelihood/Maximum Entropy Model (ML/ME), and reformulated triggers as constraints. Three types of constraints were considered: trigger pairs, ngrams, and self-triggers (where the trigger and the triggered word are the same, i.e., <math>A \rarr A</math>).
 
  
The authors compared the existent state-of-the-art static model with their new approach and obtained improvements of 27%. The author claim that, since they are using ME, their solution is simple, intuitive and general. It can accomodate new factors, just by reformulating them as constraints, and reutilize information from previous used models (as the static one). Additionally, the method used to solve the constraints, [[UsesMethod::Generalized Iterative Scaling|Generalized Iterative Scaling]], which guarantees to converge to a unique ME solution, allows incremental adaptation, adding new constraints at any time. Although is very expensive and requires that the constraints to be consistent (which may not apply if the constraints are derived from other data than the training, or are externally imposed).
+
In most [[AddressesProblem::Named Entity Recognition|NER]] systems, handling large numbers of features is expensive and might result in overtraining, demanding for a wise and informed feature selection. The solution proposed by the authors is to built a lattice for each sentence (where the vertices are the name taggs and the edges are the possible transitions) and compute the edges' transitions probabilities. When these transitions are computed, the authors apply the [[UsesMethod::Viterbi]] algorithm to find the best path and decide on the set of tags of the sentence.
  
The way the authors encoded the previously mentioned self triggers do not take into account the number of times the word has previously occurred, which is significant for the problem, as one can see in the figure above. They leave as future work modeling the frequency of occurrences and distance of occurrence.
+
This strategy outperforms the [[UsesDataset::CoNLL-X|CoNLL-2003 Shared Task]] baseline and a classic [[HMM]] approach as well. The fact that it can include many features with little concern for their relevance or independence makes this solution particularly well suited for  language-neutral entity recognition.
  
 
== Brief Description of the Method ==
 
== Brief Description of the Method ==
 +
A lattice is built for each sentence where each column contains one vertex for each possible tag and is connected by an edge to every vertex in the next column that represents a valid transition. To compute these transitions, the authors exploit some important properties of [[UsesMethod::Support Vector Machines|SVM]]'s: being able to handle very high dimensional spaces and being resistant to overfitting. When these transitions are finally estimated and applied to the lattice, the authors run [[UsesMethod::Viterbi]] to find the most likely path, which identifies the final tag for each word of the sentence.
  
The Maximum Entropy Principle is a common technique used to combine several different knowledge sources in a combined estimate. This method reformulates the different estimates as constraints on the expectation of various functions to be satisfied by the combined estimante, ending up choosing, among the probability distributions that satisfy these constraints, the one with the highest entropy.
+
However, standard [[UsesMethod::Support Vector Machines|SVM]]'s do not provide such probabilities. To accomplish this, the authors used a parametric model to fit the posterior <math>P(y=1|f)</math> directly. Since the class-conditional densities between the margins of a [[UsesMethod::Support Vector Machines|SVM]] are exponential and the Baye's rule on two exponentials suggests using a parametric form of a sigmoid, the formulation of the posterior can be written as:
 
 
Formally, given a general event space <math>\{x\}</math> , to derive a combined probability function <math>P(x)</math>, each constraint <math>i</math> is associated with a constraint function <math>f_i(x)</math> and a desired expectation <math>c_i</math>. The constraint is then written as:
 
  
 
<math>
 
<math>
E_pf_i \overset{\underset{\mathrm{def}}{}}{=} \sum_X P(x)f_i(x) = c_i
+
P(y=1|f) = \frac{1}{1(1+ \exp(Ax +b))}
</math>
+
</math>  
  
Given consistent constraints, a unique ME solution is guaranteed to exist, and to be of the form:
+
where the authors fix <math>A=-2</math> and <math>b=0</math>, recognizing that a estimation of these parameters could give best results.
  
<math>
+
In order to classify a given sentence, after training, each word of the input is represented by a vector of features (such as the word itself, character n-grams, word length and position in the sentence, capitalization pattern, etc.). Each classifier (one for each transition type) is then applied to this vector to produce a margin, that is then mapped to a probability estimate. When all the probabilities have been computed, the [[UsesMethod::Viterbi]] algorithm computation takes place.
P(x) = \prod_i \mu_i^{f_i(x)}
 
</math>
 
  
where the <math>\mu_i</math>’s are some unknown constants, to be found.  To search this exponential family for the <math>\mu_i</math>’s that will make <math>P(x)</math> satisfy all the constraints, the authors used an iterative algorithm, [[UsesMethod::Generalized Iterative Scaling|Generalized Iterative Scaling]], which is guaranteed to converge to the solution.  
+
== Experimental Results ==
 
+
The authors evaluated their approach using the [[UsesDataset::CoNLL-X|CoNLL-2003]] training and test sets. In this paper 4 entity taggers are compared: TnT tagger [http://arxiv.org/pdf/cs/0003055] , TnT + subcat (reduction over the possible categories, and therefore transitions, removing the less common ones), SVM-Lattice (uses the same categories as TnT+subcat) and SVM-Lattice+ (that uses the result of the TnT+subcat version as a feature). The results of the F-measure for these tests are present in the next table:
 
 
In the specific case of Language Modeling, the desired probability estimate is given by <math>P(h,w)</math>, being <math>\hat P(h,w)</math> the empirical distribution of the training data. Taking the general equation for a contraint, and adapting it to this problem, one ends up with:
 
 
 
<math>
 
\sum_h \hat P(h) . \sum_w P(w|h) . f_i(h, w) = c_i
 
</math>
 
 
 
where <math>f_i(h, w)</math> is any constraint function and <math>c_i</math> its desired expectation. The only missing piece is the constraint functions for the three types of triggers. So, for a general trigger pair <math>A \rarr B</math>, the authors define the constraint function <math>f_{A \rarr B}</math> as:
 
 
 
<math>
 
f_{A \rarr B}  (h,w) =
 
\begin{cases}
 
1,  & \mbox{if }A \in h, w= B\\
 
0, & \mbox{otherwise}
 
\end{cases}
 
</math>
 
 
 
To incorporate the previous static model, the authors formulated unigram, bigram and trigram constraint functions to fit the ML/ME paradigm. For instance, for bigrams:
 
  
<math>
+
[[File:MayfieldResults.png]]
f_{w_1, w_2}  (h,w) =
 
\begin{cases}
 
1,  & \mbox{if } h \mbox{ ends in } w_1 \mbox{ and } w=w_2  \\
 
0, & \mbox{otherwise}
 
\end{cases}
 
</math>
 
 
 
Similarly, the authors integrated the "bursty" nature of language, i.e., the fact that once an infrequent word occurs in a document, the probability of reoccurrence is significantly elevated, simply as a trigger pair <math>A \rarr A</math>.
 
 
 
== Experimental Results ==
 
The baseline static trigram model was tested against the authors' adaptive solutions. Four formulations were used: the ML/ME with the best 3 triggers for each word (with and without the incorporation of the trigram model), and the ML/ME with the best 6 triggers for each word (also with and without the incorporation of the trigram model). For each of the latter methods they incorporated the static trigram model as a different experiment. Each model was trained on 5 million words of [[UsesDataset::Wall Street Journal]] text, using a vocabulary comprised of the DARPA's 20k words. The results can be seen in the following table:
 
  
[[File:resultslau.png]]
+
The improvement from TnT to TnT + subcat is merely due to a reduction of the considered classes. However, SVM-Lattice achieves a slightly better improvement over the TnT tagger, and the ultimate result consists of an improvement of about 3% over the TnT tagger.  
  
As one can see there are improvements when the number of triggers considered for each word is increased. There are also improvements when the adaptive model incorporates the static trigram model. Although, the authors do not discuss significance of these results and, since there is no data for more triggers considered for each word, one cannot predict how does this variation affect the results.
+
The authors never comment on the significance of these results, and do not present results of the SVM-Lattice method with the complete set of standard taggs used in NER competitions.
  
== Related Papers ==
+
==Related Papers ==
The authors show the importance of self triggers, stating that they are responsible for a large part of the reduction in perplexity, and proved to be robust (maintain the correlations for different domains) in [[RelatedPaper::Rosenfield and Huang, 1992]].
+
[[RelatedPaper::Thorsten Brants. 2000.]] TnT – a statistical part-of-speech tagger. In ANLP 6, pages 224–231.[http://arxiv.org/pdf/cs/0003055]

Latest revision as of 11:29, 7 October 2011

Citation

James Mayfield, Paul McNamee, and Christine Piatko. 2003. Named Entity Recognition using Hundreds of Thousands of Features. In Proceedings of CoNLL-2003.

Online version

[1]

Summary

In this paper the authors address the problem of Named Entity Recognition using Support Vector Machines to capture transition probabilities in a lattice: a method they called SVM-lattice. The main goal is to provide a language independent Named Entity Recognition system, considering hundreds of thousands of language-independent features, which SVM's decide if are relevant or not.

In most NER systems, handling large numbers of features is expensive and might result in overtraining, demanding for a wise and informed feature selection. The solution proposed by the authors is to built a lattice for each sentence (where the vertices are the name taggs and the edges are the possible transitions) and compute the edges' transitions probabilities. When these transitions are computed, the authors apply the Viterbi algorithm to find the best path and decide on the set of tags of the sentence.

This strategy outperforms the CoNLL-2003 Shared Task baseline and a classic HMM approach as well. The fact that it can include many features with little concern for their relevance or independence makes this solution particularly well suited for language-neutral entity recognition.

Brief Description of the Method

A lattice is built for each sentence where each column contains one vertex for each possible tag and is connected by an edge to every vertex in the next column that represents a valid transition. To compute these transitions, the authors exploit some important properties of SVM's: being able to handle very high dimensional spaces and being resistant to overfitting. When these transitions are finally estimated and applied to the lattice, the authors run Viterbi to find the most likely path, which identifies the final tag for each word of the sentence.

However, standard SVM's do not provide such probabilities. To accomplish this, the authors used a parametric model to fit the posterior directly. Since the class-conditional densities between the margins of a SVM are exponential and the Baye's rule on two exponentials suggests using a parametric form of a sigmoid, the formulation of the posterior can be written as:

where the authors fix and , recognizing that a estimation of these parameters could give best results.

In order to classify a given sentence, after training, each word of the input is represented by a vector of features (such as the word itself, character n-grams, word length and position in the sentence, capitalization pattern, etc.). Each classifier (one for each transition type) is then applied to this vector to produce a margin, that is then mapped to a probability estimate. When all the probabilities have been computed, the Viterbi algorithm computation takes place.

Experimental Results

The authors evaluated their approach using the CoNLL-2003 training and test sets. In this paper 4 entity taggers are compared: TnT tagger [2] , TnT + subcat (reduction over the possible categories, and therefore transitions, removing the less common ones), SVM-Lattice (uses the same categories as TnT+subcat) and SVM-Lattice+ (that uses the result of the TnT+subcat version as a feature). The results of the F-measure for these tests are present in the next table:

MayfieldResults.png

The improvement from TnT to TnT + subcat is merely due to a reduction of the considered classes. However, SVM-Lattice achieves a slightly better improvement over the TnT tagger, and the ultimate result consists of an improvement of about 3% over the TnT tagger.

The authors never comment on the significance of these results, and do not present results of the SVM-Lattice method with the complete set of standard taggs used in NER competitions.

Related Papers

Thorsten Brants. 2000. TnT – a statistical part-of-speech tagger. In ANLP 6, pages 224–231.[3]