# Roth and Yih, ICML 2005. Integer Linear Programming Inference for Conditional Random Fields

## Citation

Dan Roth and Wen-tau Yih. 2005. Integer Linear Programming Inference for Conditional Random Fields. In Proceedings of the ${\displaystyle 22^{nd}}$ International Conference on Machine learning, ICML'05, New York, NY, USA.

## Summary

This paper presents an alternative approach to inference in conditional random fields using integer linear programming (ILP). The standard Viterbi algorithm based on dynamic programming, in general, cannot efficiently incorporate non-local and non-sequential constraints over the output sequence. The authors propose an ILP-based method to inference procedure. and extend CRF models to naturally and efficiently support general constraint structures. For sequential constraints, this procedure reduces to simple linear programming as the inference process.

## Method

The efficiency of the CRF approach heavily depends on its Markov property – given the observation, the label of a token is assumed to depend only on the labels of its adjacent tokens. It is difficult to explicitly model and exploit more general constraints such as long distance dependencies keeping in mind the Markovian assumption. This is a problem not only for training CRFs but also for incorporating additional constraints during inference. Viterbi algorithm can handle some of these constraints, but cannot handle more generic constraints. This method is able to take into account these constraints more efficiently.

### Conditional Random Field

A standard CRF model is represented by:

${\displaystyle P_{\lambda }(y|x)={\frac {\exp(\lambda \cdot F(y,x)}{Z_{\lambda }(x)}}}$

where ${\displaystyle F(y,x)}$ is the global feature vector, ${\displaystyle \lambda }$ is weight vector, and ${\displaystyle Z_{\lambda }(x)=\sum (\lambda \cdot F(y,x))}$ is a normalization factor.

A CRF is trained by maximizing the conditional log-likelihood of a given training set ${\displaystyle T=\{(x_{k},y_{k})\}_{k=1}^{N}}$ :

${\displaystyle L_{\lambda }=\sum _{k}\log(p_{\lambda }(y_{k}|x_{k}))=\sum _{k}[\lambda \cdot F(y_{k},x_{k})-\log Z_{\lambda }(x_{k})]}$

### Inference Using ILP

Inference in CRFs is usually performed by the Viterbi algorithm which takes into account the shortest path from start to the end. The weights on the edges of the path determine the global path of the cost. This weight or log score is obtained from a linear combination of feature functions and the weight vector.

This approach is represented in ILP as follows:

Given a directed graph ${\displaystyle G=(V,E)}$, two distinct nodes ${\displaystyle s,t\in V}$, and a non-negative cost ${\displaystyle c_{uv}}$ of each edge ${\displaystyle (u,v)\in E}$, a minimum cost path from ${\displaystyle s}$ to ${\displaystyle t}$ is desired. For each edge ${\displaystyle (u,v)\in E}$, an indicator variable ${\displaystyle x_{uv}}$ is introduced. If ${\displaystyle (u,v)}$ is in the minimum cost (shortest) path, then ${\displaystyle x_{uv}}$ is set to 1; otherwise, it is 0. The cost function is, ${\displaystyle \sum _{(u,v)\in E}c_{uv}\cdot x_{uv}}$ .

The following figure depicts ILP-based representation of CRFs subject to a set of constraints:

where, ${\displaystyle -\log M_{i}(y,y^{\prime })}$ is the cost function. Hence, ${\displaystyle min}$ changes to ${\displaystyle max}$ . The start and end nodes are denoted by ${\displaystyle -1}$ and ${\displaystyle n}$ respectively, and ${\displaystyle n,m}$ represent the sequence length and number of possible labels. Variable ${\displaystyle x_{i,yy^{\prime }}}$ represents the edge from ${\displaystyle y}$ to ${\displaystyle y^{\prime }}$.

### Constraint Representation Using ILP

The set of constraints in the output space can be expressed using a set of Boolean functions. The authors lay down some example constraints in their model:

#### Example 1

To force the output of label ${\displaystyle i}$ to be 0:

${\displaystyle \sum _{0\leq y\leq m-1}x_{i,y_{0}}=1}$

#### Example 2

There is no "duplication" of segments in a sequence:

where, the indicator variable ${\displaystyle x_{i,ab}}$ represents the antecedent.

#### Example 3

If label ${\displaystyle a}$ appears, then label ${\displaystyle b}$ must also appear:

for all ${\displaystyle i}$ such that, ${\displaystyle 0\leq i\leq n-1}$.

#### Example 4

When a segment \mathcal{A} of tokens share the same label. So, for every such segment from ${\displaystyle p}$ to ${\displaystyle q}$:

${\displaystyle v_{i,y}=\sum _{0\leq y^{\prime }\leq m-1}x_{i,y^{\prime }y}}$ ,

${\displaystyle (q-p)v_{p,l}\leq \sum _{p+1\leq i\leq q}v_{i,l}}$

for all ${\displaystyle l}$ such that, ${\displaystyle 0\leq l\leq m-1}$ . ${\displaystyle v_{i,y}}$ is a binary variable that indicates whether token ${\displaystyle i}$ is assigned label ${\displaystyle y}$.

#### Example 5

If each sequence has at least one segment of interest, i.e., some label other than "O" in BIO representation must be present:

All constraints except example 1 have long distance dependencies and cannot be captured by the Viterbi algorithm during inference time. But ILP techniques can represent these constraints efficiently.

## Experiments and Results

This approach was tested on the important NLP task of semantic role labeling (SRL). The authors used the following constraints in their model to perform the task:

• No duplicate argument labels in a sentence.
• Argument-level i.e., chunk-level candidate information.
• At least one argument for a verb in a sentence.
• Known verb position.
• Disallowing certain arguments in the structure based on verb frames.

In this framework only the last two constraints can be incorporated in the Viterbi algorithm used for decoding.

### Dataset

• The method was trained on 15-18 sections of the PropBank and tested on section 20.
• The authors included Arg0,...,Arg5 in their evaluation.

### Results

Results were reported using two SRL systems using CRF models. The first system was trained using the maximum log-likelihood approach (Quasi-Newton optimization algorithm, LBFGS; with Gaussian prior ${\displaystyle \sigma ^{2}}$ = 0.01). The second CRF-based system was trained using the discriminative method based on voted perceptron, suggested by Sha and Pereira, 2003.

Table 1 shows the performance in recall, precision and F-measure. It can be observed from the table that these general constraints improve the results with statistically significant gains for both CRF training algorithms.

## Related Papers

1. Kristjannson, T., Culotta, A., Viola, P., & McCallum, A. 2004. Interactive information extraction with constrained conditional random fields. In Proceedings of AAAI-2004

3. Punyakanok, V., Roth, D., Yih, W., & Zimak, D. 2004. Semantic role labeling via integer linear programming inference. In Proceedings of COLING-2004