Stoyanov et al 2011: Empirical Risk Minimization of Graphical Model Parameters Given Approximate Inference, Decoding, and Model Structure

Citation

Veselin Stoyanov and Alexander Ropson and Jason Eisner, "Empirical Risk Minimization of Graphical Model Parameters Given Approximate Inference, Decoding, and Model Structure", in Proceedings of AISTATS, 2011.

Online version

Stoyanov et al 2011

Summary

This is an interesting paper that presents a loopy Belief Propagation and Back Propagation method for Empirical Risk Minimization (ERM), which is an alternative training method for general problems in Probabilistic Graphical Models (e.g. possible applications include Named Entity Recognition, Word Alignment, Shallow Parsing, and Constituent Parsing). The paper formulates the approximate learning problem as an ERM problem, rather than MAP estimation. The authors show that by replacing MAP estimation, the ERM based estimation parameters significantly reduce loss on the test set, even by an order of magnitude.

Brief Description of the method

This paper first formulates the parameter estimation problem as training and decoding on Markov random fields (MRFs), then discusses the use of Belief Propagation to do inference on MRFs and the use of Back Propagation to calculate the gradient of the empirical risk. In this section, we will first summarize the Back Propagation method they use to compute the gradient of the empirical risk, then briefly describe the numerical optimization method for this task. Regarding the detailed Belief Propagation and Empirical Risk Minimization methods for general probabilistic graphical models, please refer to their corresponding method page.

Back-Propagation

Assume the task is to do ERM estimation to obtain the model parameter $\theta$ . The standard maximum log-likelihood is

\theta ^{*}={\underset {\theta }{\operatorname {argmax} }}LogL(\theta )={\underset {\theta }{\operatorname {argmax} }}\sum _{i}logp_{\theta }(x_{i},y_{i})

Instead of doing MLE training, the authors estimate the parameter $\theta$ from an empirical risk function

\!ER(\theta )={\frac {1}{n}}\sum _{i=1}^{n}L(f_{\theta }(x_{i}),y_{i}).

In order to derive the $\theta$ from above ER function, the authors propose to use a gradient-based optimizer. The basic idea of Back-Propagation here is to do Belief Propagation as the forward pass using a decoder function $d$ . The loss relative to the truth $y^{*}$ can be obtained as $l(y^{*},y^{i})$ . Then, the backward pass will calculate the partials of loss with respect to the hypothesis and next with respect to the marginal beliefs (with respect to the input parameter $\theta$ ). Regarding the backward pass, the ultimate goal is to compute the adjoint $\partial \theta _{j}$

Stoyanov et al 2011: Empirical Risk Minimization of Graphical Model Parameters Given Approximate Inference, Decoding, and Model Structure

Contents

Citation

Online version

Summary

Brief Description of the method

Back-Propagation

Numerical Optimization

Dataset

Experimental Results

Related Papers

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools