Difference between revisions of "Paper:Takeuchi and Collier, CoNLL 2002"

From Cohen Courses
Jump to navigationJump to search
 
Line 4: Line 4:
 
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.4019&rep=rep1&type=pdf Here] is the online version of the paper.
 
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20.4019&rep=rep1&type=pdf Here] is the online version of the paper.
 
== Summary ==
 
== Summary ==
This [[category::paper]] explores the use of [[Support Vector Machines]] (SVMs) for an extended named entity (NE) task and compares it's performance with standard [[HMM]] bigram model. The author distinguishes between traditional NE and ''extended'' NE (referred as ''NE+'') as the latter being able to capture types, i.e. instances of conceptual classes as well as individuals. NE's main role to identify expressions such as the names of people, places, organizations etc. becomes hard to accomplish using traditional NLP because there is an infinite variety and new expressions are constantly being invented. Such expressions (termed as ''NE+'') require richer contextual evidence than is needed for regular NEs - for eg. knowledge of the head noun or the predicate.
+
This [[category::paper]] explores the use of [[Support Vector Machines]] (SVMs) for an extended [[AddressesProblem::Named Entity Recognition|named entity]] (NE) task and compares it's performance with standard [[HMM]] bigram model. The author distinguishes between traditional NE and ''extended'' NE (referred as ''NE+'') as the latter being able to capture types, i.e. instances of conceptual classes as well as individuals. NE's main role to identify expressions such as the names of people, places, organizations etc. becomes hard to accomplish using traditional NLP because there is an infinite variety and new expressions are constantly being invented. Such expressions (termed as ''NE+'') require richer contextual evidence than is needed for regular NEs - for eg. knowledge of the head noun or the predicate.
  
 
The authors implement and compare two learning methods ([[Support Vector Machines|SVM]], [[HMM]]) and tested on two datasets.
 
The authors implement and compare two learning methods ([[Support Vector Machines|SVM]], [[HMM]]) and tested on two datasets.
=== [[Support Vector Machines|SVM]] ===
+
=== [[UsesMethod::Support Vector Machines|SVM]] ===
[[Support Vector Machines|SVMs]] are known to robustly handle large feature sets and to develop models that maximize their generalizability and makes them ideal for ''NE+'' task. In the implementation each training pattern is given as a vector which represents certain lexical features and a context. The lexical features include surface word forms, part of speech, orthographic features and previous word class tags. The orthographic features used are the ones described in [[Paper::Collier et al., 2000]]. The full window of context considered in the experiments is <math>\pm3</math> about the focus word. In ''NE+'' chunk identification each word was assigned a tag from <math>\{I\_C_{t}, B\_C_{t}, O\}</math> where <math>C_{t}</math> is the class, <math>B</math> stands for a beginning of chunk tag, <math>I</math> stands for an in-chunk tag, and <math>O</math> stands for outside of chunk, i.e. not a member of the given class. Two versions of [[Support Vector Machines|SVM]] were implemented. <math>SVM^{1}</math> uses a <math>\pm3</math> window about the focus word and is implemented with the polynomial kernel function. <math>SVM^{2}</math> uses only features of focus word and previous word.
+
[[Support Vector Machines|SVMs]] are known to robustly handle large feature sets and to develop models that maximize their generalizability and makes them ideal for ''NE+'' task. In the implementation each training pattern is given as a vector which represents certain lexical features and a context. The lexical features include surface word forms, part of speech, orthographic features and previous word class tags. The orthographic features used are the ones described in [[Collier et al., 2000]]. The full window of context considered in the experiments is <math>\pm3</math> about the focus word. In ''NE+'' chunk identification each word was assigned a tag from <math>\{I\_C_{t}, B\_C_{t}, O\}</math> where <math>C_{t}</math> is the class, <math>B</math> stands for a beginning of chunk tag, <math>I</math> stands for an in-chunk tag, and <math>O</math> stands for outside of chunk, i.e. not a member of the given class. Two versions of [[Support Vector Machines|SVM]] were implemented. <math>SVM^{1}</math> uses a <math>\pm3</math> window about the focus word and is implemented with the polynomial kernel function. <math>SVM^{2}</math> uses only features of focus word and previous word.
=== [[HMM]] ===
+
=== [[UsesMethod::HMM]] ===
The [[HMM]] considered here is the one fully described in [[Paper::Collier et al., 2000]]. It is a linear interpolating [[HMM]] trained using maximum likelihood estimates from bigrams of the surface word and an orthographic feature chosen deterministically.  
+
The [[HMM]] considered here is the one fully described in [[Collier et al., 2000]]. It is a linear interpolating [[HMM]] trained using maximum likelihood estimates from bigrams of the surface word and an orthographic feature chosen deterministically.  
  
The results show that [[Support Vector Machines|SVM]] outperforms [[HMM]] by a significant margin on both [[Dataset::MUC-6]] and [[Dataset::Bio1]] [[Dataset|datasets]] if it is given a wide context window (<math>\pm3</math>) and a rich feature set. Another thing the authors noticed is that the [[Support Vector Machines|SVM]] lacked sufficient knowledge about complex structures in ''NE+'' expressions to achieve its best performance on [[Dataset::Bio1]].
+
The results show that [[Support Vector Machines|SVM]] outperforms [[HMM]] by a significant margin on both [[UsesDataset::MUC-6]] and [[UsesDataset::Bio1]] [[Dataset|datasets]] if it is given a wide context window (<math>\pm3</math>) and a rich feature set. Another thing the authors noticed is that the [[Support Vector Machines|SVM]] lacked sufficient knowledge about complex structures in ''NE+'' expressions to achieve its best performance on [[UsesDataset::Bio1]].
 
== Experimental Result ==
 
== Experimental Result ==
 
Results are given as F-scores. The following table shows the overall F-score for the three models and two collections, calculated using 10-fold cross validation on the total test collection. <math>^{\dagger} </math> signifies results for models using surface word and orthographic features but not POS features. <math>\ddagger</math> signifies results for models using surface word, orthographic and POS features.
 
Results are given as F-scores. The following table shows the overall F-score for the three models and two collections, calculated using 10-fold cross validation on the total test collection. <math>^{\dagger} </math> signifies results for models using surface word and orthographic features but not POS features. <math>\ddagger</math> signifies results for models using surface word, orthographic and POS features.
Line 20: Line 20:
 
There is a clear and sustained advantage by <math>\text{SVM}^{1}</math> over HMM for the NE task in MUC-6 and the NE+ task in Bio1. The only drawback observed with <math>\text{SVM}^{2}</math> was that it seemed to be quite weak for the very low frequency classes. However by exploiting the SVMs capability to easily handle large feature sets including a wide context window and POS tags the results suggest that the SVM will perform at a significantly higher level than the HMM.
 
There is a clear and sustained advantage by <math>\text{SVM}^{1}</math> over HMM for the NE task in MUC-6 and the NE+ task in Bio1. The only drawback observed with <math>\text{SVM}^{2}</math> was that it seemed to be quite weak for the very low frequency classes. However by exploiting the SVMs capability to easily handle large feature sets including a wide context window and POS tags the results suggest that the SVM will perform at a significantly higher level than the HMM.
 
== Related papers ==
 
== Related papers ==
This paper compares and contrasts SVM with the HMM implementation in [[Paper::Collier et al., 2000]].
+
This paper compares and contrasts SVM with the HMM implementation in [[Collier et al., 2000]].

Latest revision as of 20:38, 29 October 2011

Citation

Use of Support Vector Machines in Extended Named Entity Recognition, Takeuchi and Collier, CoNLL 2002

Online Version

Here is the online version of the paper.

Summary

This paper explores the use of Support Vector Machines (SVMs) for an extended named entity (NE) task and compares it's performance with standard HMM bigram model. The author distinguishes between traditional NE and extended NE (referred as NE+) as the latter being able to capture types, i.e. instances of conceptual classes as well as individuals. NE's main role to identify expressions such as the names of people, places, organizations etc. becomes hard to accomplish using traditional NLP because there is an infinite variety and new expressions are constantly being invented. Such expressions (termed as NE+) require richer contextual evidence than is needed for regular NEs - for eg. knowledge of the head noun or the predicate.

The authors implement and compare two learning methods (SVM, HMM) and tested on two datasets.

SVM

SVMs are known to robustly handle large feature sets and to develop models that maximize their generalizability and makes them ideal for NE+ task. In the implementation each training pattern is given as a vector which represents certain lexical features and a context. The lexical features include surface word forms, part of speech, orthographic features and previous word class tags. The orthographic features used are the ones described in Collier et al., 2000. The full window of context considered in the experiments is about the focus word. In NE+ chunk identification each word was assigned a tag from where is the class, stands for a beginning of chunk tag, stands for an in-chunk tag, and stands for outside of chunk, i.e. not a member of the given class. Two versions of SVM were implemented. uses a window about the focus word and is implemented with the polynomial kernel function. uses only features of focus word and previous word.

HMM

The HMM considered here is the one fully described in Collier et al., 2000. It is a linear interpolating HMM trained using maximum likelihood estimates from bigrams of the surface word and an orthographic feature chosen deterministically.

The results show that SVM outperforms HMM by a significant margin on both MUC-6 and Bio1 datasets if it is given a wide context window () and a rich feature set. Another thing the authors noticed is that the SVM lacked sufficient knowledge about complex structures in NE+ expressions to achieve its best performance on Bio1.

Experimental Result

Results are given as F-scores. The following table shows the overall F-score for the three models and two collections, calculated using 10-fold cross validation on the total test collection. signifies results for models using surface word and orthographic features but not POS features. signifies results for models using surface word, orthographic and POS features.

Table1 Collier.png

There is a clear and sustained advantage by over HMM for the NE task in MUC-6 and the NE+ task in Bio1. The only drawback observed with was that it seemed to be quite weak for the very low frequency classes. However by exploiting the SVMs capability to easily handle large feature sets including a wide context window and POS tags the results suggest that the SVM will perform at a significantly higher level than the HMM.

Related papers

This paper compares and contrasts SVM with the HMM implementation in Collier et al., 2000.