Sha 2003 Shallow Parsing with Conditional Random Fields
Fei, S. and Pereira, F. 2003. Shallow Parsing with Conditional Random Fields. In Proceedings of NAACL.
An online version of this paper is available .
The paper presents the following key findings
- The authors claim CRF outperform other models or as good as any reported results in shallow parsing, one of the applications of sequential labeling task
- The authors also report that the improved training methods proves great efficiency as shown in their experimental results
Conditional Random Fields
The paper focuses on conditional random fields on sequences. Such CRFs define conditional probability distributions p(Y|X) of label sequences given input sequences. The label and input sequences are assumed to have the same length.
A CRF on (X, Y) is specified by a local feature vector and a weight vector, the local features are defined as follows:
And the global feature vector is thus an aggregate of the local features:
The conditional probability distribution defined by CRF is then defined as
In sequential labeling task, we would like to find the most probable label sequence for input sequence, thus we use the following formula
The decoding process can be done with the Viterbi algorithm.
The early training methods proposed was iterative scaling algorithm from maximum entropy model. However, later work shows that other optimization method has better convergence speed, for example conjugate gradient method. In this paper, authors test the preconditioned conjugate-gradient (CG) and limited-memory quasi newton (L-BFGS) method and show their superior performance on large problems. They also provide a comparison of these algorithms over other methods.
Shallow parsing is also called NP chunking, the input consists of the words in a sequence annotated with POS tags. And the task is to label each words with a tag indicating whether it is inside a chunk (O), starts a chunk (B), or inside a chunk (I). The authors use modified CoNLL dataset with a development test set from WSJ section 21.
The authors model a chunking CRF for this purpose. The chunking CRF has a factored local feature set with predicates from pairs of labels and features based on surrounding words and/or their POS tags. The feature set is illustrated in the following figure:
Also the model includes a Gaussian weight prior as penalty term to reduce overfitting. The evaluation is based on F-measures on the output label sequences and the gold standard label sequences. The authors also present a significance tests on comparison of different chunking systems.
The authors present two sets of results, in the first one, they present the F-score comparison between several different chunking systems, and the significance test results on the comparisons. It clearly shows that the proposed method is as good as any reported result, which is as shown below.
In the second one, they present the training time efficiency on different optimization methods as shown below, which tells that the L-BFGS used in this paper significantly outperforms all other methods in time efficiency thus allow the training on large corpus. They also provide a few figures on detailed comparison between optimization methods.