Difference between revisions of "Structured Prediction Cascades"

From Cohen Courses
Jump to navigationJump to search
 
(24 intermediate revisions by the same user not shown)
Line 20: Line 20:
 
== Brief description of the method ==
 
== Brief description of the method ==
  
The linear hypothesis class considered is of the form: <math>h_w(x) = \arg\max_{y \in \mathcal{Y}} \sum_{c \in \mathcal{C}} w^\mathrm{T} f_c(x,y_c)</math> where <math>\mathcal{C}</math> is the set of cliques.
+
The linear hypothesis class considered is of the form: <math>h_{\theta}(x) = \arg\max_{y \in \mathcal{Y}} \sum_{c \in \mathcal{C}} \theta^\mathrm{T} f_c(x,y_c)</math> where <math>\mathcal{C}</math> is the set of cliques.
  
 
At each level of the cascade the method is given the set of possible clique assignments as input, and each level then filters
 
At each level of the cascade the method is given the set of possible clique assignments as input, and each level then filters
Line 37: Line 37:
 
Any <math>y_c \in \mathcal{V}^i_c</math> is then pruned if their max-marginal score is less than a threshold.
 
Any <math>y_c \in \mathcal{V}^i_c</math> is then pruned if their max-marginal score is less than a threshold.
  
The threshold is the function <math>t_x(\alpha) = \alpha \theta^*_x + (1-\alpha)\frac{1}{|V|}\sum_{c \in \mathcal{C} , y_c \in \mathcal{V}_c} \theta^*_x(y_c)</math>  
+
The threshold is the function <math>t(x,\alpha) = \alpha \theta^*(x) + (1-\alpha)\frac{1}{|V|}\sum_{c \in \mathcal{C} , y_c \in \mathcal{V}_c} \theta^*(x,y_c)</math>  
 
where  
 
where  
  
 
<math>\alpha</math> determines the number of max marginals eliminated
 
<math>\alpha</math> determines the number of max marginals eliminated
 
   
 
   
<math>\theta_x(y) = w^\mathrm{T}f(x,y)</math>
+
<math>\theta(x,y) = \theta^\mathrm{T}f(x,y)</math>
  
<math>\theta^*_x(y_c) = \max_{y'\in \mathcal{Y}} \{ \theta_x(y') : y'_c = y_c\}</math>, the score of best possible output that contains the assignments <math>y_c</math>
+
<math>\theta^*(x,y_c) = \max_{y'\in \mathcal{Y}} \{ \theta(x,y') : y'_c = y_c\}</math>, the score of best possible output that contains the assignments <math>y_c</math>
  
<math>\theta^*_x = max_y \theta_x(y)</math>
+
<math>\theta^*(x) = max_y \theta(x,y)</math>
  
 
=== Learning ===
 
=== Learning ===
 +
 +
At each level of the cascade the best <math>\theta</math> and <math>\alpha</math> must be inferred.
 +
 +
At each level two competing objectives exist: accuracy and efficiency. This trade off is quantified by by the ''filtering loss'' <math>\mathcal{L}_f</math> and ''efficiency loss'' <math>\mathcal{L}_e</math>:
 +
 +
<math>\mathcal{L}_f = 1[\theta(x,y) \le t(x,\alpha)]</math>
 +
 +
<math>\mathcal{L}_e = \frac{1}{|\mathcal{V}|}\sum_{c \in \mathcal{C}, y_c \in \mathcal{V}_c} 1[\theta^*(x,y_c) \le t(x,\alpha)]</math>
 +
 +
The cascade learning objective is then expressed by:
 +
 +
<math>\min_{\theta,\alpha} \mathbb{E}[\mathcal{L}_e(Y, \theta(X))] \text{ s.t. } \mathbb{E}[\mathcal{L}_f(Y,\theta(X))] \le \epsilon</math>
 +
 +
where <math>\epsilon</math> is a parameter that bounds the expected error rate
 +
 +
Given <math>\alpha</math> we choose \theta to optimize the convex problem:
 +
 +
<math> \inf_{\theta} \frac{\lambda}{2}||\theta||^2 + \frac{1}{n} \sum_i H(\theta; (x^i, y^i))</math>
 +
 +
where <math>H</math> is the upper bound of <math>\mathcal{L}_f</math> and convex:
 +
 +
<math>H(\theta;(x^i, y^i)) = max\{0, l + t(x^i,\alpha) - \theta^\mathrm{T}f(x^i, y^i)\}</math>
 +
 +
<math>H</math> is a hinge loss function that is positive if the true score is above the threshold by <math>l</math>. <math>\theta</math> is solved for by using stochastic sub-gradient descent.
  
 
== Experimental Result ==
 
== Experimental Result ==
  
 +
In this work structured prediction cascades were applied to handwriting recognition and POS tagging.
 +
 +
The results for handwriting recognition were not compared to any other results, but rather showed the dramatic increase in accuracy as the order of the model increased. The model they used was a Markov model, and the paper explains that the search space of a 4-gram model was reduced by a magnitude of 5, while only producing a 3.41% filtering error.
 +
 +
 +
{| border=1 cellpadding=3 cellspacing=0
 +
! Model Order:
 +
! 1
 +
! 2
 +
! 3
 +
! 4
 +
|-
 +
! Accuracy, Char (%)
 +
| align=center | 77.44
 +
| align=center | 85.69
 +
| align=center | 87.95
 +
| align=center | 92.25
 +
|-
 +
! Accuracy, Word (%)
 +
| align=center | 26.65
 +
| align=center | 49.44
 +
| align=center | 73.83
 +
| align=center | 84.46
 +
|-
 +
! Filter Loss (%)
 +
| align=center | 0.56
 +
| align=center | 0.99
 +
| align=center | 3.41
 +
| align=center | -
 +
|-
 +
! Avg. Num n-grams
 +
| align=center | 26.0
 +
| align=center | 123.8
 +
| align=center | 88.6
 +
| align=center | 5.4
 +
|-
 +
|}
 +
 +
 +
POS tagging was applied to the WSJ portion of the Penn TreeBank (Marcus et al., 1993). The following table shows the comparative results of the standard structured perceptron (SP), structured cascades using a markov model (SC), and a conditional random field (CRF), and a heuristic baseline in which only POS tags associated with a given word in the training set were searched during inference (Tags). Accuracy for the different methods was pretty much the same, but structured cascades was able to perform more efficiently at test time.
 +
 +
{| border=1 cellpadding=3 cellspacing=0
 +
! Model:
 +
! SP
 +
! SC
 +
! CRF
 +
! Tags
 +
|-
 +
! Accuracy (%)
 +
| align=center | 96.83
 +
| align=center | 96.82
 +
| align=center | 96.84
 +
| align=center | -
 +
|-
 +
! Filter Loss (%)
 +
| align=center | 0
 +
| align=center | 0.121
 +
| align=center | '''0.024'''
 +
| align=center | 0.118
 +
|-
 +
! Test Time (ms)
 +
| align=center | 173.28
 +
| align=center | '''1.56'''
 +
| align=center | 4.16
 +
| align=center | 10.6
 +
|-
 +
! Avg. Num States
 +
| align=center | 1935.7
 +
| align=center | '''3.93'''
 +
| align=center | 11.845
 +
| align=center | 95.39
 +
|-
 +
|}
  
 
== Related papers ==
 
== Related papers ==

Latest revision as of 20:55, 5 October 2011

This method as proposed by Weiss et al, AISTATS 2010

This page is reserved for a write up by Dan Howarth


Citation

Structured Prediction Cascades. David Weiss and Ben Taskar. International Conference on Artificial Intelligence and Statistics (AISTATS), May 2010.

Online version

[1]

Summary

In many structured prediction models an increase in model complexity comes at a high computational cost. For example, the complexity of a HMM grows exponentially with the order of the model. This work introduces a method for learning increasingly complex models while continually pruning the possible output space. This is done by "weeding" out the incorrect output states early on.

Previous methods to solve the problem of model complexity were commonly approximate search methods or heuristic pruning techniques. Structured prediction cascades are different however because they explicitly learn the error/computation trade off for each increase in model complexity.

In this work structured prediction cascades are applied to handwriting recognition and POS tagging.

Brief description of the method

The linear hypothesis class considered is of the form: where is the set of cliques.

At each level of the cascade the method is given the set of possible clique assignments as input, and each level then filters this set and passes the filtered set as input to the next level.

is the set of possible output assignments to clique

Let the following be defined for the 'th model:

, the set of maximal cliques

, the set of valid output assignments for clique

Before pruning, , or the set of assignments to the cliques in level that each contain as a subset a valid assignment in the 'th level

Any is then pruned if their max-marginal score is less than a threshold.

The threshold is the function where

determines the number of max marginals eliminated

, the score of best possible output that contains the assignments

Learning

At each level of the cascade the best and must be inferred.

At each level two competing objectives exist: accuracy and efficiency. This trade off is quantified by by the filtering loss and efficiency loss :

The cascade learning objective is then expressed by:

where is a parameter that bounds the expected error rate

Given we choose \theta to optimize the convex problem:

where is the upper bound of and convex:

is a hinge loss function that is positive if the true score is above the threshold by . is solved for by using stochastic sub-gradient descent.

Experimental Result

In this work structured prediction cascades were applied to handwriting recognition and POS tagging.

The results for handwriting recognition were not compared to any other results, but rather showed the dramatic increase in accuracy as the order of the model increased. The model they used was a Markov model, and the paper explains that the search space of a 4-gram model was reduced by a magnitude of 5, while only producing a 3.41% filtering error.


Model Order: 1 2 3 4
Accuracy, Char (%) 77.44 85.69 87.95 92.25
Accuracy, Word (%) 26.65 49.44 73.83 84.46
Filter Loss (%) 0.56 0.99 3.41 -
Avg. Num n-grams 26.0 123.8 88.6 5.4


POS tagging was applied to the WSJ portion of the Penn TreeBank (Marcus et al., 1993). The following table shows the comparative results of the standard structured perceptron (SP), structured cascades using a markov model (SC), and a conditional random field (CRF), and a heuristic baseline in which only POS tags associated with a given word in the training set were searched during inference (Tags). Accuracy for the different methods was pretty much the same, but structured cascades was able to perform more efficiently at test time.

Model: SP SC CRF Tags
Accuracy (%) 96.83 96.82 96.84 -
Filter Loss (%) 0 0.121 0.024 0.118
Test Time (ms) 173.28 1.56 4.16 10.6
Avg. Num States 1935.7 3.93 11.845 95.39

Related papers