# SEARN

## Citation

A detailed description and introduction to SEARN is available here.

## Motivation

SEARN is a meta-algorithm for doing structured prediction. The basic premise is to combine learning and searching to transform a complex structured prediction problem into a simple classification problem. The learning algorithm is designed to begin with a classifier that uses the training data directly, and use that classifier to produce a fully-learned classifier. In this sense, it moves away from the classifier created by the training data.

## Input

Running the SEARN meta-algorithm requires a few different inputs.

• ${\displaystyle L(y,f({\hat {y}}))}$ - A loss function, which must be computable for any sequence of predictions
• ${\displaystyle A}$ - A cost-sensitive learning algorithm. This algorithm will produce learned classifiers, which SEARN refers to as policies.
• ${\displaystyle \pi }$ - The optimal policy. This should produce low loss when applied to the training data

## Algorithm

The algorithm is defined in Daume et al, ML 2009

First, begin with the current policy being the optimal policy. The goal of the algorithm is to actually move away from the optimal policy, in an effort to produce a more generalized classifier. This is because the optimal policy is generally produced by searching the training data, or sometimes is provided by an expert system.

The outermost loop requires an unclear amount of iterations. The design is that once the model has run through enough iterations, the interpolation weight on the initial policy should be degraded far enough that the model is learned.

The inner loop involves using the current policy to classify examples, and then use those examples with the learning algorithm to produce a new policy. We then find an interpolation constant and interpolate the newly learned classifier with the current policy and update the current policy.

The interpolation constant ${\displaystyle \beta }$ is generally different throughout each iteration of the algorithm. A practical value is to use ${\displaystyle 1/T^{3}}$ on the initial iteration. A general note is that ${\displaystyle \beta }$ is often much higher after the first iteration, as the first iteration moves the classifier from completely fitting the training data to a statistical method.

Approximation can also be used to avoid keeping the algorithm from being prohibitively expensive.

## Output

The policy ${\displaystyle h_{last}}$ given by the algorithm can be used to classify sequences.

## Theoretical Analysis

It is provable that there is a bound on the amount of degradation that occurs when updating the new policy in the algorithm, and it is bounded by the following equation:

${\displaystyle L(D,h_{new})\leq L(D,h)+T\beta \ell _{h}^{CS}(h')+{1/2}\beta ^{2}T^{2}c_{max}}$

In addition, we can also say that after ${\displaystyle C/\beta }$ iterations, the loss is bounded by the following equation:

${\displaystyle L(D,h_{last})\leq L(D,\pi )+CT\ell _{avg}+c_{max}({1/2}CT^{2}\beta +T\operatorname {exp} [-C])}$

The proofs for the above theorems can be found here.

## Comparison To Other Methods

One of the major benefits of SEARN is the flexibility it allows. It allows for arbitrary classifiers, an arbitrary algorithm for learning, and an arbitrary loss function. Perceptron models are generally more limited, as they cannot use an arbitrary loss function. CRFs are generally required to be applied to linear chain models, which SEARN is not limited to.