Alternating Minimization
Citation
Information Geometry and Alternating Minimization Procedures, by I. Csiszar, G.Tusnady. In Statistics and Decisions, 1984.
I unfortunately cannot find this paper online.
Summary
Alternating minimization (AM) is used when we are optimizing two variables jointly. Let us call these two variables and . The basic premise behind AM is that we keep one variable constant, say and optimize the other variable, i.e., , and then we keep constant and optimize over . This constitutes one cycle of the optimization procedure. One may find that AM is very similar to the EM algorithm, and in fact it has been shown by the original authors that EM is a special case of AM.
In recent work, this Method was presented in Soft-Supervised Text Classification as the primary method of solving or optimizing the objective function.
Details
The alternating minimization algorithm attempts to solve a minimization problem of the following form:
given and a function , we would like to find: .
The way AM does this is by generating a sequence of variables as follows: .
The original paper has a prove that shows the sequence generated using the above single-variable minimizations converges to the minimum of , i.e., Failed to parse (unknown function "\math"): {\displaystyle \lim_{n /to \infty} D(p^{(n)}, q^{(n)}) = \inf_{p,q} D(p,q) <\math>. == Example == As an example, let us look at what Subramanya and Bilmes did in their [[SoftSupervisedTextClassification | Soft-Supervised Text Classification]] paper. For them, <math>(p^n, q^n)} is a sequence of sets of distributions, i.e., every is a set of distributions over all vertices in the graph. We know that their original formulation for their algorithm was:
, where .
In order to use alternating minimization, we introduce a second distribution over unlabeled data (apart from ), . The new objective reads:
, where , where we define a new similarity matrix , with . We thus have two distributions defined for every training point, and the optimization procedure encourages the two to approach each other. Note that the smoothness regularizer (the third term, consisting of entropy) is defined on a different set of distributions as the first term, which aims to respect the labeled training data. It is this choice which makes the update equations for and relatively simple.
Assuming we start with a distribution that is initialized properly, we can express the update equations as:
Failed to parse (syntax error): {\displaystyle p_i^{(n)} (y) = \frac{1}{\mathcal{Z}_i} exp\left\{ \frac{\beta_i^{(n-1)}(y)}{\gamma_i}} , and , where and .
The derivations for these updates can be obtained by using the definitions of and provided above.