Difference between revisions of "GeneralizedIterativeScaling"
(11 intermediate revisions by the same user not shown) | |||
Line 15: | Line 15: | ||
</math> | </math> | ||
− | where <math>I</math> is an index set; the probability distribution over which has to be determined, <math>p</math> is a probability distribution and <math>\pi</math> is a subprobability function (adds to 1 but <math>\pi_i \neq 0</math> for any <math>i</math>; <math>b_{si} \neq 0</math> is constant. | + | where <math>I</math> is an index set; the probability distribution over which has to be determined, <math>p</math> is a probability distribution and <math>\pi</math> is a subprobability function (adds to 1 but <math>\pi_i \neq 0</math> for any <math>i</math>); <math>b_{si} \neq 0</math> is constant. |
+ | |||
+ | Since <math>\log \left( \frac{p_i}{\pi_i} \right)</math> is linear in <math>\mu</math> and <math>\mu_i</math>, <math>p</math> belongs to the log linear family. | ||
== Existence of a solution == | == Existence of a solution == | ||
− | If <math>p</math> of form (1) exists satisfying (2), then it minimizes <math>KL[p, \pi]</math> and is unique. | + | If <math>p</math> of form (1) exists satisfying (2), then it minimizes |
+ | |||
+ | <math>KL[p, \pi] = \sum_i p_i \log \left( \frac{p_i}{\pi_i} \right)</math> | ||
+ | |||
+ | and is unique. Since <math>\pi_i</math> are constant; it essentially boils down to the following statement. | ||
+ | |||
+ | === Maximum entropy === | ||
+ | |||
+ | If there exists a positive probability function of the form | ||
+ | |||
+ | <math> p_i = \mu \prod_{s=1}^{d} \mu_s^{b_{si}} </math> | ||
+ | |||
+ | satisfying (2), then it maximizes the entropy | ||
+ | |||
+ | <math> H(p) = - \sum_{i\in I} p_i log(p_i)</math> | ||
+ | |||
+ | This statement is equivalent to saying that if there are a set of features whose expected value is known, then the probability distribution (if there exists one) that maximizes the entropy (makes minimum assumptions) is of the form (1). | ||
+ | |||
+ | == Algorithm == | ||
+ | |||
+ | Given that constraints (2) is satisfied by atleast one sub-probability function (this condition is also known as consistency of constraints), then (1) and (2) can be expressed as | ||
+ | |||
+ | <math> (1') \quad \quad p_i = \pi_i \prod_{r=1}^{c} \lambda_r^{a_{ri}}</math> | ||
+ | |||
+ | <math> (2') \quad \quad \sum_{i\in I} a_{ri} p_i = h_r, \quad \quad r = 1, 2, \dots, c </math> | ||
+ | |||
+ | where <math>a_{ri} \geq 0, \quad \sum_{r=1}^c a_{ri} = 1, \quad h_r > 0, \quad \sum_{r=1}^c h_r = 1 </math> | ||
+ | |||
+ | Define | ||
+ | |||
+ | <math>p_i^{(0)} = \pi_i</math> | ||
+ | |||
+ | Iterate | ||
+ | |||
+ | <math>p_i^{(n+1)} = p_i^{(n)} \prod_{r=1}^c \left( \frac{h_r}{h_r^{(n)}} \right)^{a_{ri}};</math> | ||
+ | |||
+ | where | ||
+ | |||
+ | <math>h_r^{(n)} = \sum_{i\in I} a_{ri}p_i^{(n)} </math> | ||
+ | |||
+ | == Used in == | ||
+ | |||
+ | [[RelatedPaper::Frietag 2000 Maximum Entropy Markov Models for Information Extraction and Segmentation]] | ||
+ | |||
+ | == References == | ||
+ | |||
+ | J. N. Darroch and D. Ratcliff, Generalized Iterative Scaling for log linear models. |
Latest revision as of 00:41, 3 November 2011
This is one of the earliest methods used for inference in log-linear models. Though more sophisticated and faster methods have evolved, this method provides an insight in log linear models.
Contents
What problem does it address
The objective of this method is to find a probability function of the form
satisfying the constraints
where is an index set; the probability distribution over which has to be determined, is a probability distribution and is a subprobability function (adds to 1 but for any ); is constant.
Since is linear in and , belongs to the log linear family.
Existence of a solution
If of form (1) exists satisfying (2), then it minimizes
and is unique. Since are constant; it essentially boils down to the following statement.
Maximum entropy
If there exists a positive probability function of the form
satisfying (2), then it maximizes the entropy
This statement is equivalent to saying that if there are a set of features whose expected value is known, then the probability distribution (if there exists one) that maximizes the entropy (makes minimum assumptions) is of the form (1).
Algorithm
Given that constraints (2) is satisfied by atleast one sub-probability function (this condition is also known as consistency of constraints), then (1) and (2) can be expressed as
where
Define
Iterate
where
Used in
Frietag 2000 Maximum Entropy Markov Models for Information Extraction and Segmentation
References
J. N. Darroch and D. Ratcliff, Generalized Iterative Scaling for log linear models.