Stoyanov et al 2011: Empirical Risk Minimization of Graphical Model Parameters Given Approximate Inference, Decoding, and Model Structure

Citation

Veselin Stoyanov and Alexander Ropson and Jason Eisner, "Empirical Risk Minimization of Graphical Model Parameters Given Approximate Inference, Decoding, and Model Structure", in Proceedings of AISTATS, 2011.

Online version

Stoyanov et al 2011

Summary

This is an interesting paper that presents a loopy Belief Propagation and Back Propagation method for Empirical Risk Minimization (ERM), which is an alternative training method for general problems in Probabilistic Graphical Models (e.g. possible applications include Named Entity Recognition, Word Alignment, Shallow Parsing, and Constituent Parsing). The paper formulates the approximate learning problem as an ERM problem, rather than MAP estimation. The authors show that by replacing MAP estimation, the ERM based estimation parameters significantly reduce loss on the test set, even by an order of magnitude.

Brief Description of the method

This paper first formulates the parameter estimation problem as training and decoding on Markov random fields (MRFs), then discusses the use of Belief Propagation to do inference on MRFs and the use of Back Propagation to calculate the gradient of the empirical risk. In this section, we will first summarize the Back Propagation method they use to compute the gradient of the empirical risk, then briefly describe the numerical optimization method for this task. Regarding the detailed Belief Propagation and Empirical Risk Minimization methods for general probabilistic graphical models, please refer to their corresponding method page.

Back-Propagation

Assume the task is to do ERM estimation to obtain the model parameter $\theta$ . The standard maximum log-likelihood is

\theta ^{*}={\underset {\theta }{\operatorname {argmax} }}LogL(\theta )={\underset {\theta }{\operatorname {argmax} }}\sum _{i}logp_{\theta }(x_{i},y_{i})

Instead of doing MLE training, the authors estimate the parameter $\theta$ from an empirical risk function

\!ER(\theta )={\frac {1}{n}}\sum _{i=1}^{n}L(f_{\theta }(x_{i}),y_{i}).

In order to derive the $\theta$ from above ER function, the authors propose to use a gradient-based optimizer. The basic idea of Back-Propagation here is to do Belief Propagation as the forward pass using a decoder function $d$ . The loss relative to the truth $y^{*}$ can be obtained as $l(y^{*},y^{i})$ . Then, the backward pass will calculate the partials of loss with respect to the hypothesis and next with respect to the marginal beliefs (with respect to the input parameter $\theta$ ). Regarding the backward pass, the ultimate goal is to compute the adjoint $\partial \theta _{j}$ for each parameter $\theta _{j}$ , which is the gradient. Assume V is the loss function ,and after we run the first forward pass, we need to differentiate the loss function to obtain the adjoints of the decoded output

\partial d(x_{i})=\partial V*{\frac {\partial V}{\partial d(x_{i})}}

The authors then propose to use the chain rule again to propagate backward through the decoder to obtain the adjoints of the beliefs

\partial b(x_{j})=\sum _{i}\partial d(x_{i})*{\frac {\partial d(x_{i})}{\partial b(x_{j})}}

We omit the detail steps of running backward pass here (can be found in the appendix of the original paper), but if each real-valued potential $\phi _{\alpha }(\chi _{\alpha })$ ( $\chi _{\alpha }$ means a specific assignment to variable $(X_{\alpha })$ ) is derived from $\theta$ by some function $\phi _{\alpha }(\chi _{\alpha })=f(\theta )$ , then adjoints of $\theta$ from the final $\phi$ adjoints can be computed as

\partial \theta _{i}=\sum _{\alpha ,\chi _{\alpha }}\partial \phi _{\alpha }(\chi _{\alpha })*{\frac {\partial f(\phi _{\alpha }(\chi _{\alpha }))}{\partial \theta _{i}}}

Numerical Optimization

The authors point out that even the objective function is locally bumpy, it is possible to use stochastic gradient descent to escape some small local optimums. However, the authors have also pointed out that when collecting 2nd-order information, they obtain better $\theta$ . They performed a stochastic meta-descent (SMD) method (Schraudolph, 1999) which maintains a separate positive gain adaptation $\eta _{i}$ for each optimization dimension $\theta _{i}$ . Thus, $\theta$ updates are scaled by $\eta _{i}$ .

Dataset and Experiment Settings

The authors generated synthetic MRFs using various parameters in the above table. In terms of the output, there are three types: (1)integer (2)fractional(soft) (3)distributional. In the following experiments, four loss functions are evaluated: (1) L1 loss (2)Mean squared error (3)F-measure (4)Conditional log-likelihood.

Experimental Results

The authors performed three major experiments. The first experiment is the overall experiment that compares the BP based ERM to the standard MAP estimation learning. The second experiment explores the model structure mismatch problem, and the last experiment shows the different BP approximation quality.

Overall Result

The overall experiment shows that when the exact MRF model structure is known, there is enough training data and BP is run to convergence. When comparing with standard learning, the MSE loss function was the best one, while most of other settings also significantly outperformed the standard learning setting.

Model Structure Mismatch

This experiment simulates the real situation where we need to address the issue of model structure mismatch. As shown in the above table, almost all ERM methods are significantly different from standard learning schemes, except for L1 loss when the perturbation is 30% and 40%.

Approximation Quality

Finally, to test the approximation quality of loopy BP, we can force the approximate algorithms to terminate early. As shown in the above table, when the quality of approximation decreases, the back-propagation training is still very robust.