Sutton McCullum ICML 2007: Piecewise pseudolikelihood for efficient CRF training

Citation

Piecewise Pseudolikelihood for Efficient Training of Conditional Random Fields. By Charles Sutton, Andrew McCallum. In ICML, vol. {{{volume}}} ({{{issue}}}), 2007.

Online version

This Paper is available here.

Summary

Discriminative training of graphical models is expensive if the cardinality of the variables is large. Generally pseudo-likelihood reduces the cost of inference, but compromises on accuracy. Piecewise training although is accurate, is expensive in a similar way. The authors try to maximize the pseudo-likelihood on the piecewise model.

Definition of Piecewise Pseudo likelihood

For a single instance ${\vec {x}},{\vec {y}}$ ,

${\mathcal {L}}_{\mbox{PWPL}}(\Lambda ,{\vec {x}},{\vec {y}})=\sum _{a}\sum _{s\in a}\ln p_{\mbox{LCL}}(y_{s}|{\vec {y}}_{a-s},{\vec {x}},\lambda _{a})$

Thus, piecewise pseudolikelihood is a sum of local conditional log-probabilities. The local probabilities are defined as

$p_{\mbox{LCL}}(y_{s}|{\vec {y}}_{a-s},{\vec {x}},\lambda _{a})={\frac {\Psi _{a}(y_{s}|{\vec {y}}_{a-s},{\vec {x}},\lambda _{a})}{Z({\vec {y}}_{a-s},{\vec {x}},\lambda _{a})}}$

Therefore the optimization function is

$O=\sum _{i}{\mathcal {L}}_{\mbox{PWPL}}(\Lambda ,{\vec {x}}^{(i)},{\vec {y}}^{(i)})-\sum _{a}{\frac {\lambda _{a}^{2}}{2\sigma ^{2}}}$

where the second term is the standard gaussian prior to prevent over fitting.

Compared to standard piecewise which requires $O(m^{K})$ time (where $m$ is the maximum cardinality of the label and $K$ is the maximum size of a factor), PWPL requires $O(m)$ . Compared to pseudolikelihood where each term conditions on the entire Markov blanket, in PWPL, it conditions on its neighbors.

Experiments

The authors propose that PWPL performs better on small datasets whereas pseudo likelihood performs better on large datasets. They verify this by generating data from a second order [HMM] with transition and emission probabilities as a linear combination ( $\alpha \times {\mbox{second-order}}+(1-\alpha )\times {\mbox{first-order}}$ ). Higher $\alpha$ represents more complexity and deviation from the assumption of first-order. For different values of $\alpha$ , they generate 1000 sequences of length 25 and 1000 for 150 synthetic generating models. A first-order CRF is trained over these sets.

On an average PWPL performs identically to standard piecewise training. Although pseudolikelihood performs better than PWPL, the result is not statistically significant. However, when the accuracy as a function of training set size is compared, pseudolikelihood converges to a limit higher than PWPL.

Other experiments conducted included POS Tagging over the Penn Treebank, Noun-phrase chunking and Named Entity Recognition over CoNLL'03. The performance in terms of time taken is significantly less in all the three tasks.

Sutton McCullum ICML 2007: Piecewise pseudolikelihood for efficient CRF training

Contents

Citation

Online version

Summary

Definition of Piecewise Pseudo likelihood

Experiments

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools