Gimpel and Smith, NAACL 2010

Softmax-Margin CRFs: Training Log-Linear Models with Cost Functions

This paper can be found at: [1]

Citation

Kevin Gimpel and Noah A. Smith. Softmax-margin CRFs: Training log-linear models with loss functions. In Proceedings of the Human Language Technologies Conference of the North American Chapter of the Association for Computational Linguistics, pages 733-736, Los Angeles, California, USA, June 2010.

Summary

The authors want to be able to incorporate a cost function (present in structured SVMs) into standard conditional log-likelihood models. They introduce the softmax-margin objective function that achieves the best of both worlds. Using a NER task, it performs significantly better than a standard conditional loglikelihood model, a max-margin model, and the perceptron, but is indistinguishable from MIRA, risk, and JRB (Jensen risk bound; defined in the paper).

Brief Description of the Softmax-Margin objective function

Consider the objective functions for these four methods. The author's goal is to incorporate parts of conditional log likelihood and max-margin. As we can see, softmax has terms from each of these two methods. The paper lays out three rationals:

bigger mistakes should be penalized more, like in max-margin methods
take the conditional log likelihood function and just add a cost score.
"replace the 'hard' maximum of max-margin with the 'softmax' ( $\log \sum \exp$ ) from [conditional log likelihood]; hence we use the name 'softmax-margin'".

One of the reasons softmax is so cool is that its convex, so we can optimize it easily. In the paper, they prove that softmax is greater than or equal to the conditional log likelihood as well as max-margin.

Conditional log likelihood: $\min _{\theta }\sum _{i=1}^{n}-{\boldsymbol {\theta }}^{T}{\boldsymbol {f}}(x^{(i)},y^{(i)})+\log \sum _{y\in {\mathcal {Y}}(x^{(i)})}\exp\{{\boldsymbol {\theta }}^{T}{\boldsymbol {f}}(x^{(i)},y)\}$

Max-margin: $\min _{\theta }\sum _{i=1}^{n}-{\boldsymbol {\theta }}^{T}{\boldsymbol {f}}(x^{(i)},y^{(i)})+\max _{y\in {\mathcal {Y}}(x^{(i)})}({\boldsymbol {\theta }}^{T}{\boldsymbol {f}}(x^{(i)},y)+cost(y^{(i)},y))$

Risk: $\min _{\theta }\sum _{i=1}^{n}\sum _{y\in {\mathcal {Y}}(x^{(i)})}cost(y^{(i)},y){\dfrac {\exp\{{\boldsymbol {\theta }}^{T}{\boldsymbol {f}}(x^{(i)},y)\}}{\sum _{y'\in {\mathcal {Y}}(x^{(i)})}\exp\{{\boldsymbol {\theta }}^{T}{\boldsymbol {f}}(x^{(i)},y')\}}}$

Softmax-margin: $\min _{\theta }\sum _{i=1}^{n}-{\boldsymbol {\theta }}^{T}{\boldsymbol {f}}(x^{(i)},y^{(i)})+\log \sum _{y\in {\mathcal {Y}}(x^{(i)})}\exp\{{\boldsymbol {\theta }}^{T}{\boldsymbol {f}}(x^{(i)},y)+cost(y^{(i)},y)\}$

Experimental Results

The authors perform a small experiment on the CoNLL 2003 shared task in which they take care to give each model the same features. Evaluated F1 scores.

JRB (Jensen risk bound) is defined as the function $\sum _{i=0}^{n}\log \mathbb {E} _{(i)}[\exp cost(y^{(i)},\cdot )]$ which is an upper bound on risk, but is much easier to compute than risk (risk is not necessarily convex)

Results

Softmax is statistically better than conditional log likelihood, max-margin, and perceptron models; but statistically indistinguishable from MIRA, risk, and JRB.

Related Work

The longer work: Gimpel and Smith, CMU 2010. K. Gimpel and N. A. Smith. 2010. Softmax-margin training for structured log-linear models. Technical report, Carnegie Mellon University
Other people have tried to incorporate costs into a model. Previous literature:

Gimpel and Smith, NAACL 2010

Contents

Citation

Summary

Brief Description of the Softmax-Margin objective function

Experimental Results

Results

Related Work

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools