Difference between revisions of "Alternating Minimization"

Latest revision as of 09:24, 28 March 2017

Citation

Information Geometry and Alternating Minimization Procedures, by I. Csiszar, G.Tusnady. In Statistics and Decisions, 1984.

http://www.mit.edu/~6.454/www_fall_2002/shaas/Csiszar.pdf

Summary

Alternating minimization (AM) is used when we are optimizing two variables jointly. Let us call these two variables $p$ and $q$ . The basic premise behind AM is that we keep one variable constant, say $q$ and optimize the other variable, i.e., $p$ , and then we keep $p$ constant and optimize over $q$ . This constitutes one cycle of the optimization procedure. One may find that AM is very similar to the EM algorithm, and in fact it has been shown by the original authors that EM is a special case of AM.

In recent work, this Method was presented in Soft-Supervised Text Classification as the primary method of solving or optimizing the objective function.

Details

The alternating minimization algorithm attempts to solve a minimization problem of the following form:

given ${\mathcal {P}},{\mathcal {Q}}$ and a function $D:{\mathcal {P}}\times {\mathcal {Q}}$ , we would like to find: $\min _{(P,Q)\in {\mathcal {P}}\times {\mathcal {Q}}}D(P,Q)$ .

The way AM does this is by generating a sequence of variables $(p^{n},q^{n})$ as follows: $p^{n}=\arg \min _{\textrm {P}}D(p,q^{n-1})$ , and $q^{n}=\arg \min _{q}D(p^{n},q)$ .

In order for AM to converge to the correct solution, certain conditions must be satisfied. These conditions can be encapsulated in a property known as the 5-points property. A theorem in the original paper states the conditions for convergence:

Theorem (Convergence of AM)

If $p^{(n)}=\arg \min _{p\in \Delta ^{m}}D(p,q^{n-1}),q^{n}=\arg \min _{q\in \Delta ^{m}}D(p^{n},q)$ and $q_{i}^{(0)}(y)>0\quad \forall y\in Y,\forall i$ , then

$D(p,q)+D(p,p^{(0)})\geq D(p,q^{(1)})+D(p^{(1)},q^{(1)})\quad \forall p,q\in \Delta ^{m}$ , and
$\lim _{n\to \infty }D(p^{(n)},q^{(n)})=\inf _{p,q\in \Delta ^{m}}D(p,q)$

where $\Delta ^{m}$ is the set consisting of m-length vectors, where each component in each vector lies within a |Y|-dimensional probability simplex, which is denoted as $\Delta$ . The first point is the 5-point property, and the second point is the convergence statement.

Example

As an example, let us look at what Subramanya and Bilmes did in their Soft-Supervised Text Classification paper. For them, $(p^{n},q^{n})$ is a sequence of sets of distributions, i.e., every $p^{i}$ is a set of distributions over all vertices in the graph. We know that their original formulation for their algorithm was:

$\min _{\textrm {p}}C_{1}({\textrm {p}})$ , where $C_{1}(p)=\left[\sum _{i=1}^{l}D_{KL}(r_{i}\parallel p_{i})+\mu \sum _{i}^{n}\sum _{j}w_{ij}D_{KL}(p_{i}\parallel p_{j})-\nu \sum _{i=1}^{n}H(p_{i})\right]$ .

In order to use alternating minimization, we introduce a second distribution over unlabeled data (apart from $p$ ), $q$ . The new objective reads:

$\min _{\textrm {p,q}}C_{2}({\textrm {p,q}})$ , where $C_{2}(p,q)=\left[\sum _{i=1}^{l}D_{KL}(r_{i}\parallel q_{i})+\mu \sum _{i}^{n}\sum _{j\in {\mathcal {N}}(i)}w'_{ij}D_{KL}(p_{i}\parallel q_{j})-\nu \sum _{i=1}^{n}H(p_{i})\right]$ , where we define a new similarity matrix $W'=W+\alpha \mathbf {I} _{n}$ , with $\alpha \geq 0$ . We thus have two distributions defined for every training point, and the optimization procedure encourages the two to approach each other. Note that the smoothness regularizer (the third term, consisting of entropy) is defined on a different set of distributions as the first term, which aims to respect the labeled training data. It is this choice which makes the update equations for $p^{(n)}$ and $q^{(n)}$ relatively simple.

Assuming we start with a distribution that is initialized properly, we can express the update equations as:

$p_{i}^{(n)}(y)={\frac {1}{{\mathcal {Z}}_{i}}}\exp \left\{{\frac {\beta _{i}^{(n-1)}(y)}{\gamma _{i}}}\right\}$ , and $q_{i}^{(n)}(y)={\frac {r_{i}(y)\delta (i\leq l)+\mu \sum _{j}w'_{ji}p_{j}^{(n)}(y)}{\delta (i\leq l)+\mu \sum _{j}w'_{ji}}}$ ,

where $\gamma _{i}=\nu +\mu \sum _{j}w'_{ij}$ and $\beta _{i}^{(n-1)}(y)=-\nu +\mu \sum _{j}w'_{ij}(\log q_{j}^{(n-1)}(y)-1)$ .

The derivations for these updates can be obtained by using the definitions of $p^{n}$ and $q^{n}$ provided above.

@@ Line 1: / Line 1: @@
+== Citation ==
+{{MyCiteconference | booktitle = Statistics and Decisions | coauthors =  G.Tusnady | date = 1984| first = I.| last = Csiszar | title = Information Geometry and Alternating Minimization Procedures}}
+http://www.mit.edu/~6.454/www_fall_2002/shaas/Csiszar.pdf
 == Summary ==
-This [[Category::Method]] was presented in
+Alternating minimization (AM) is used when we are optimizing two variables jointly.  Let us call these two variables <math>p</math> and <math>q</math>.  The basic premise behind AM is that we keep one variable constant, say <math>q</math> and optimize the other variable, i.e., <math>p</math>, and then we keep <math>p</math> constant and optimize over <math>q</math>.  This constitutes one cycle of the optimization procedure.  One may find that AM is very similar to the EM algorithm, and in fact it has been shown by the original authors that EM is a special case of AM.
+In recent work, this [[Category::Method]] was presented in [[SoftSupervisedTextClassification | Soft-Supervised Text Classification]] as the primary method of solving or optimizing the objective function.
 == Details ==
-== Related Work ==
+The alternating minimization algorithm attempts to solve a minimization problem of the following form:
+given <math>\mathcal{P}, \mathcal{Q}</math> and a function <math>D: \mathcal{P} \times \mathcal{Q}</math>, we would like to find:
+<math>\min_{(P, Q) \in \mathcal{P} \times \mathcal{Q}} D(P, Q) </math>.
+The way AM does this is by generating a sequence of variables <math>(p^n, q^n)</math> as follows:
+<math>p^n = \arg\min_{\textrm{P}} D(p, q^{n-1}) </math>, and <math>q^n = \arg\min_{q} D(p^n, q) </math>.
+In order for AM to converge to the correct solution, certain conditions must be satisfied.  These conditions can be encapsulated in a property known as the 5-points property.  A theorem in the original paper states the conditions for convergence:
+'''Theorem (Convergence of AM)'''
+If <math>p^{(n)} = \arg\min_{p \in \Delta^m} D(p, q^{n-1}), q^n = \arg\min_{q \in \Delta^m} D(p^n, q) </math> and <math>q_i^{(0)} (y) > 0 \quad \forall y \in Y, \forall i</math>, then
+* <math> D(p,q) + D(p,p^{(0)}) \geq D(p, q^{(1)}) + D(p^{(1)}, q^{(1)}) \quad \forall p, q \in \Delta^m </math>, and
+* <math> \lim_{n \to \infty} D(p^{(n)}, q^{(n)}) = \inf_{p, q \in \Delta^m} D(p,q) </math>
+where <math>\Delta^m</math> is the set consisting of ''m''-length vectors, where each component in each vector lies within a |Y|-dimensional probability simplex, which is denoted as <math>\Delta</math>.  The first point is the 5-point property, and the second point is the convergence statement.
+== Example ==
+As an example, let us look at what Subramanya and Bilmes did in their [[SoftSupervisedTextClassification | Soft-Supervised Text Classification]] paper.  For them, <math>(p^n, q^n)</math> is a sequence of sets of distributions, i.e., every <math>p^i</math> is a set of distributions over all vertices in the graph.  We know that their original formulation for their algorithm was:
+<math> \min_{\textrm{p}} C_1(\textrm{p}) </math>, where <math>C_1(p) = \left[ \sum_{i=1}^l D_{KL} (r_i \parallel p_i) + \mu \sum_i^n \sum_j w_{ij}D_{KL}(p_i \parallel p_j) - \nu \sum_{i=1}^n H(p_i) \right]</math>.
+In order to use alternating minimization, we introduce a second distribution over unlabeled data (apart from <math>p</math>), <math>q</math>.  The new objective reads:
+<math> \min_{\textrm{p, q}} C_2(\textrm{p, q}) </math>, where <math>C_2(p, q) = \left[ \sum_{i=1}^l D_{KL} (r_i \parallel q_i) + \mu \sum_i^n \sum_{j\in \mathcal{N}(i)} w'_{ij}D_{KL}(p_i \parallel q_j) - \nu \sum_{i=1}^n H(p_i) \right]</math>,
+where we define a new similarity matrix <math>W' = W + \alpha \mathbf{I}_n</math>, with <math> \alpha \geq 0 </math>.  We thus have two distributions defined for every training point, and the optimization procedure encourages the two to approach each other.  Note that the smoothness regularizer (the third term, consisting of entropy) is defined on a different set of distributions as the first term, which aims to respect the labeled training data.  It is this choice which makes the update equations for <math>p^{(n)}</math> and <math>q^{(n)}</math> relatively simple.
+Assuming we start with a distribution that is initialized properly, we can express the update equations as:
+<math> p_i^{(n)} (y) = \frac{1}{\mathcal{Z}_i} \exp\left\{ \frac{\beta_i^{(n-1)}(y)}{\gamma_i}\right\}</math>, and
+<math> q_i^{(n)} (y) = \frac{r_i(y) \delta(i \leq l) + \mu \sum_j w'_{ji} p_j^{(n)} (y)}{\delta(i \leq l) + \mu \sum_j w'_{ji}} </math>,
+where <math> \gamma_i = \nu + \mu \sum_j w'_{ij} </math> and <math> \beta_i^{(n-1)} (y) = -\nu + \mu \sum_j w'_{ij} (\log q_j^{(n-1)} (y) -1) </math>.
+The derivations for these updates can be obtained by using the definitions of <math>p^n</math> and <math>q^n</math> provided above.
+== Related Papers ==
+* [[RelatedPaper::Soft Supervised Text Classification | Subramanya & Bilmes, ''EMNLP 2008'', Soft Supervised Text Classification]]
+* [[RelatedPaper::Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification | Subramanya & Bilmes, ''NIPS 2009'', Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification]]
+* [[RelatedPaper:: Information Geometry and Alternating Minimization Procedures |  Csiszar & Tusnady, ''Statistics and Decisions 1984'', Information Geometry and Alternating Minimization Procedures ]]

Difference between revisions of "Alternating Minimization"

Latest revision as of 09:24, 28 March 2017

Contents

Citation

Summary

Details

Example

Related Papers

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools