Difference between revisions of "Structured Ensemble Cascades"
(10 intermediate revisions by the same user not shown) | |||
Line 14: | Line 14: | ||
== Summary == | == Summary == | ||
− | This work introduces a method for intractable inference by "sidestepping" the inference all together by learning a group of sub-models in a [[structured prediction cascade]]. This builds on the authors previous work of structured prediction cascades where intractable models are learned by learning increasingly complex models while also progressively pruning the set of possible outputs. See [[structured prediction cascades]] for an more information about this method. | + | This work introduces a method for intractable inference by "sidestepping" the inference all together by learning a group of sub-models in a [[structured prediction cascade]]. For instance, inference on loopy graphical models is intractable. This method overcomes this intractability by splitting the model up into submodels that are loop-less. This builds on the authors previous work of structured prediction cascades where intractable models are learned by learning increasingly complex models while also progressively pruning the set of possible outputs. See [[structured prediction cascades]] for an more information about this method. |
== Brief description of the method == | == Brief description of the method == | ||
+ | See the [[description of structured prediction cascades]] before continuing. The notation used there is the same here. | ||
− | + | The method here is basically the same except that instead of having a single model for each level, there are <math>P</math> sub-models that need to be taken into account at each level. | |
+ | At each level the score of the overall model <math>\theta(x,y)</math> is defined by the sum of the sub-models: <math>\theta(x,y) = \sum_p \theta_p(x,y) </math>. The max marginals are defined similarly: <math>\theta^*(x,y_j) = \sum_p \theta^*_p(x, y_j)</math> | ||
+ | |||
+ | As in SPC, the <math>y_j</math> that are not pruned are those whose max marginals are above a threshold function. The threshold function is the sum of the threshold functions for each model (as defined in SPC): <math>t(x,\alpha) = \sum_p t_p(x, \alpha)</math> | ||
+ | |||
+ | Note that as in previous methods such as [[dual decomposition]] it is not necessary that all sub-models agree on the argmax solution. This allows structured ensemble cascades to enjoy only a linear (factor of <math>P</math>) increase of inference time. | ||
+ | |||
+ | The optimization function that is learned is then the same as SCP with all models taken into account for smoothing: | ||
+ | |||
+ | |||
+ | <math> \inf_{\theta_1,...,\theta_P} \frac{\lambda}{2}||\sum_p\theta_p||^2 + \frac{1}{n} \sum_i H(\theta; (x^i, y^i))</math> | ||
+ | |||
+ | Sub-gradient descent is used to find an optimal <math>\theta</math> and each model is updated only when a mistake has been made jointly. | ||
== Experimental Result == | == Experimental Result == | ||
+ | The authors applied this method to a synthetic image segmentation task and also a human-pose identification task | ||
+ | |||
+ | For the synthetic segmentation task the method was compared to [[Loopy Belief Propagation]]. The chart below shows the results. Error rates were determined by counting the number of times "ground truth" was pruned if the top K states were kept for each variable, where 1000 "ground truth" examples were chosen by Gibbs sampling from the joint distribution. | ||
+ | |||
+ | [[File:Weiss_et_al_NIPS_2010_A.png|250px]] | ||
+ | |||
+ | The charts below show the | ||
== Related papers == | == Related papers == |
Latest revision as of 10:17, 6 October 2011
This method as proposed by Weiss et al, NIPS 2010
This page is reserved for a write up by Dan Howarth
Contents
Citation
Sidestepping Intractable Inference with Structured Ensemble Cascades. David Weiss, Benjamin Sapp, and Ben Taskar. Neural Information Processing Systems (NIPS), December 2010.
Online version
Summary
This work introduces a method for intractable inference by "sidestepping" the inference all together by learning a group of sub-models in a structured prediction cascade. For instance, inference on loopy graphical models is intractable. This method overcomes this intractability by splitting the model up into submodels that are loop-less. This builds on the authors previous work of structured prediction cascades where intractable models are learned by learning increasingly complex models while also progressively pruning the set of possible outputs. See structured prediction cascades for an more information about this method.
Brief description of the method
See the description of structured prediction cascades before continuing. The notation used there is the same here.
The method here is basically the same except that instead of having a single model for each level, there are sub-models that need to be taken into account at each level.
At each level the score of the overall model is defined by the sum of the sub-models: . The max marginals are defined similarly:
As in SPC, the that are not pruned are those whose max marginals are above a threshold function. The threshold function is the sum of the threshold functions for each model (as defined in SPC):
Note that as in previous methods such as dual decomposition it is not necessary that all sub-models agree on the argmax solution. This allows structured ensemble cascades to enjoy only a linear (factor of ) increase of inference time.
The optimization function that is learned is then the same as SCP with all models taken into account for smoothing:
Sub-gradient descent is used to find an optimal and each model is updated only when a mistake has been made jointly.
Experimental Result
The authors applied this method to a synthetic image segmentation task and also a human-pose identification task
For the synthetic segmentation task the method was compared to Loopy Belief Propagation. The chart below shows the results. Error rates were determined by counting the number of times "ground truth" was pruned if the top K states were kept for each variable, where 1000 "ground truth" examples were chosen by Gibbs sampling from the joint distribution.
The charts below show the