Difference between revisions of "Sha 2003 Shallow Parsing with Conditional Random Fields"

From Cohen Courses
Jump to navigationJump to search
 
(9 intermediate revisions by the same user not shown)
Line 9: Line 9:
 
== Summary ==
 
== Summary ==
  
This [[Category::paper]] introduces the analysis of Conditional Random Field in the task of shallow parsing. The author presents the comparison results with previous methods on both accuracy and time efficiency.
+
This [[Category::paper]] introduces the analysis of [[UsesMethod::Conditional Random Fields]] in the task of shallow parsing. The author presents the comparison results with previous methods on both accuracy and time efficiency.
  
 
== Key Contributions ==
 
== Key Contributions ==
Line 19: Line 19:
 
* The authors also report that the improved training methods proves great efficiency as shown in their experimental results
 
* The authors also report that the improved training methods proves great efficiency as shown in their experimental results
  
== Introduction ==
+
== Conditional Random Fields ==
 +
The paper focuses on conditional random fields on sequences. Such CRFs define conditional probability distributions p(Y|X) of label sequences given input sequences. The label and input sequences are assumed to have the same length.
 +
 
 +
A CRF on (X, Y) is specified by a local feature vector and a weight vector, the local features are defined as follows:
 +
 
 +
[[File:p1.png]]
 +
 
 +
And the global feature vector is thus an aggregate of the local features:
  
The paper first introduces the major problem of information extraction for voicemail is to identify the caller and a call back number if available. This, if not extracted, will take a person 36 seconds in average since she/he has to listen to the whole voicemail.  
+
[[File:p2.png]]
  
This paper focuses on only the transcribed voicemail text instead of including a speech recognition front-end. However, it still differs from traditional Named Entity Recognition and Phone Number Extraction tasks for two major reasons: one is that the voicemail transcript is text based on spoken language thus the linguistic elements are different (for example, 400-1425 can be referred to as four zero zero fourteen twenty two), the other is the structure of voicemail can be exploited to better extract caller identifications without sophisticated structured model for NER.
+
The conditional probability distribution defined by CRF is then defined as
  
Instead of using the MaxEnt extractor with ngram features as described in the previous work, this paper first shows some empirical analysis of the data and focuses on heuristics based features with decision tree model.
+
[[File:p3.png]]
  
== Conditional Random Fields ==
+
In sequential labeling task, we would like to find the most probable label sequence for input sequence, thus we use the following formula
The paper uses a proprietary data set consisting of almost 10,000 voicemail messages with manual transcription and marks. As illustrated in the following excerpt.
+
 
 +
[[File:p4.png]]
 +
 
 +
The decoding process can be done with the Viterbi algorithm.
  
<blockquote><greeting> hi Jane </greeting> <caller> this is Pat Caller </caller> I just wanted to I know you’ve probably seen this or maybe you already know about it . . . so if you could give me a call at <telno> one two three four five </telno> when you get the message I’d like to chat about it hope things are well with you <closing> talk to you soon </closing></blockquote>
+
The early training methods proposed was iterative scaling algorithm from maximum entropy model. However, later work shows that other optimization method has better convergence speed, for example conjugate gradient method. In this paper, authors test the preconditioned conjugate-gradient (CG) and limited-memory quasi newton (L-BFGS) method and show their superior performance on large problems. They also provide a comparison of these algorithms over other methods.
  
 
== Shallow Parsing ==
 
== Shallow Parsing ==
The authors present two set of algorithms on two different sub-tasks, Caller Identification/Phone Number Extraction.
+
Shallow parsing is also called NP chunking, the input consists of the words in a sequence annotated with POS tags. And the task is to label each words with a tag indicating whether it is inside a chunk (O), starts a chunk (B), or inside a chunk (I). The authors use modified CoNLL dataset with a development test set from WSJ section 21.
  
* For Caller Identification (Caller Phrase/Caller Name), the authors focus on two targeted variables - the starting position of the Caller Phrase/Caller Name and the length of it. The authors first show empirical distributions of the position and the length, which are both highly skewed. The authors then apply decision tree learner with a small set of features based on common words for learning these two targeted variables. The authors then present the comparison between their results and those from previous work. In fact, the new algorithm performs worse than the previous method on their dataset. However, they do observe a great improvement over previous method on unseen data, which they picked the ASR (automatic speech recognition) results. Thus the authors argue that the previous method with generic named entity recognizer tends to overfit the dataset and the new algorithm is more robust for unseen data. They also transfer this technique to extract Caller Name instead of Caller Phrase.
+
The authors model a chunking CRF for this purpose. The chunking CRF has a factored local feature set with predicates from pairs of labels and features based on surrounding words and/or their POS tags. The feature set is illustrated in the following figure:
  
* For Phone Number Extraction, the authors propose a two phrase approach. In the first phrase, the algorithm uses a hand-crafted grammar to propose candidate phone numbers and convert them into numeric presentation. In the second phrase, the algorithm uses a binary classifier to consider the validity of every phone number candidate. In the performance comparison, this rather simple method shows great effectiveness and the authors present a 10% improvement on F-measure over previous method.
+
[[File:features.png]]
 +
 
 +
Also the model includes a Gaussian weight prior as penalty term to reduce overfitting. The evaluation is based on F-measures on the output label sequences and the gold standard label sequences. The authors also present a significance tests on comparison of different chunking systems.
  
 
== Experiments ==
 
== Experiments ==
  
The [[RelatedPaper::Huang et al., 2001]] paper discussed a very similar problem but rather with a traditional perspective, it studied three approaches: hand-crafted rules, grammatical inference of sub-sequential transducers and log-linear classifier with bi-gram and tri-gram features, which is essentially the same as in [[RelatedPaper::Ratnaparkhi, 1996]] paper on Maxent POS tagging.
+
The authors present two sets of results, in the first one, they present the F-score comparison between several different chunking systems, and the significance test results on the comparisons. It clearly shows that the proposed method is as good as any reported result, which is as shown below.
 +
 
 +
[[File:f-score.png]]
 +
 
 +
In the second one, they present the training time efficiency on different optimization methods as shown below, which tells that the L-BFGS used in this paper significantly outperforms all other methods in time efficiency thus allow the training on large corpus. They also provide a few figures on detailed comparison between optimization methods.
 +
 
 +
[[File:time.png]]

Latest revision as of 22:36, 30 November 2010

Citation

Fei, S. and Pereira, F. 2003. Shallow Parsing with Conditional Random Fields. In Proceedings of NAACL.

Online version

An online version of this paper is available [1].

Summary

This paper introduces the analysis of Conditional Random Fields in the task of shallow parsing. The author presents the comparison results with previous methods on both accuracy and time efficiency.

Key Contributions

The paper presents the following key findings

  • The authors claim CRF outperform other models or as good as any reported results in shallow parsing, one of the applications of sequential labeling task
  • The authors also report that the improved training methods proves great efficiency as shown in their experimental results

Conditional Random Fields

The paper focuses on conditional random fields on sequences. Such CRFs define conditional probability distributions p(Y|X) of label sequences given input sequences. The label and input sequences are assumed to have the same length.

A CRF on (X, Y) is specified by a local feature vector and a weight vector, the local features are defined as follows:

P1.png

And the global feature vector is thus an aggregate of the local features:

P2.png

The conditional probability distribution defined by CRF is then defined as

P3.png

In sequential labeling task, we would like to find the most probable label sequence for input sequence, thus we use the following formula

P4.png

The decoding process can be done with the Viterbi algorithm.

The early training methods proposed was iterative scaling algorithm from maximum entropy model. However, later work shows that other optimization method has better convergence speed, for example conjugate gradient method. In this paper, authors test the preconditioned conjugate-gradient (CG) and limited-memory quasi newton (L-BFGS) method and show their superior performance on large problems. They also provide a comparison of these algorithms over other methods.

Shallow Parsing

Shallow parsing is also called NP chunking, the input consists of the words in a sequence annotated with POS tags. And the task is to label each words with a tag indicating whether it is inside a chunk (O), starts a chunk (B), or inside a chunk (I). The authors use modified CoNLL dataset with a development test set from WSJ section 21.

The authors model a chunking CRF for this purpose. The chunking CRF has a factored local feature set with predicates from pairs of labels and features based on surrounding words and/or their POS tags. The feature set is illustrated in the following figure:

Features.png

Also the model includes a Gaussian weight prior as penalty term to reduce overfitting. The evaluation is based on F-measures on the output label sequences and the gold standard label sequences. The authors also present a significance tests on comparison of different chunking systems.

Experiments

The authors present two sets of results, in the first one, they present the F-score comparison between several different chunking systems, and the significance test results on the comparisons. It clearly shows that the proposed method is as good as any reported result, which is as shown below.

F-score.png

In the second one, they present the training time efficiency on different optimization methods as shown below, which tells that the L-BFGS used in this paper significantly outperforms all other methods in time efficiency thus allow the training on large corpus. They also provide a few figures on detailed comparison between optimization methods.

Time.png