Difference between revisions of "Generalized Iterative Scaling"

From Cohen Courses
Jump to navigationJump to search
 
(12 intermediate revisions by the same user not shown)
Line 1: Line 1:
Being edited by Rui Correia
+
== The Method and When to Use it ==
 +
The Generalized Iterative Scaling (GIS) is a [[Category::method]] that searches the exponential family of a Maximum Entropy solution of the form:
  
== What is the method ==
+
<math>
The Generalized Iterative scaling algorithm is a [Category:Method]
+
P(x) = \prod_i \mu_i ^{f_i(x)}
the circumstances under which it is meant to be used
+
</math>
  
 +
where the <math>\mu_i</math>'s are some unknown constants to be found, that guarantees to converge to a solution. The <math>\mu_i</math>'s of the solution would be such that will make <math>P(x)</math> satisfy all the constraints <math>K_i</math>, of the equation:
  
you are expected to explain clearly what the method is
+
<math>
 +
E_Pf_i \overset{\underset{\mathrm{def}}{}}{=}  \sum_x P(x)f_i(x) = K_i
 +
</math>
  
and list papers that use it
+
GIS was proposed by Darroch Ratclif and is similar in form and computational cost to the expectation-maximization algorithm. Previous methods to estimate the maximum likelihood were limited to the restriction <math>f_i(x) \in {0,1}</math> so, GIS, as the name states, generalized so that these functions could be any real number.
  
inputs and resources,  
+
== The Algorithm ==
 +
GIS starts with arbitrary <math>\mu_i ^{(0)}</math> values, wich define the initial probability estimate:
  
outputs
+
<math>
 +
P^{(0)}(x) \overset{\underset{\mathrm{def}}{}}{=} \prod_i \mu_i ^{(0) f_i (x)}
 +
</math>
  
things the method is comparable to.  
+
Each next iteration is intended to create an estimate, that will match the constraints better than the last one. Each <math>j</math> iteration follows the steps:
  
Explain what motivations or assumptions underlie the method
+
*(1) Compute the expectations of all the <math>f_i</math>'s under the current estimate function, i.e., <math>\sum_x P^{(j)}(x)f_i (x)</math>
  
  
what makes it different from earlier methods.
 
  
 +
*(2) Compare the present values with the desired ones, updating the <math>\mu_i</math>'s in the following manner:
  
 +
<math>\mu_i ^{(j+1)} = \mu_i ^{(j)} . \frac {K_i}{E_{P^{(j)}}f_i}</math>
  
Advantages:
+
 
� incorporates feature selection,
+
 
scales up well in numbers of features w.r.t. other
+
*(3) Set the new estimate function:
methods,
+
 
resilient to feature dependence.
+
 
Disadvantages:
+
<math>P^{(j+1)} \overset{\underset{\mathrm{def}}{}}{=} \prod_i \mu_i ^{(j+1)f_i(x)}</math>
slow in training relative to other classi�cation
+
 
methods,
+
 
� problems with smoothing,
+
 
� binary and summation assumptions (not a big
+
*(4) If convergence or near-convergence is reached stop; otherwise go back to step (1)
deal)
+
 
 +
== Intrinsic characteristics ==
 +
 
 +
GIS has three advantages when compared to other methods: it is able to incorporate feature selection, scales up well in number of features and is resilient to feature dependence.
 +
 
 +
On the other hand GIS has problems with smoothing and is relatively slow in training when compared to other classification methods.
  
 
== Related Papers ==
 
== Related Papers ==
 +
 +
John N. Darroch, and Douglas Ratcliff. (1972). "Generalized Iterative Scaling for Log-Linear Models." In: The Annals of Mathematical Statistics, 43(5). [http://www.cs.nyu.edu/~roweis/csc412-2006/extras/gis.pdf]
 +
 +
Adwait Ratnaparkhi. (1996). "[[Ratnaparkhi EMNLP 1996|A Maximum Entropy Model for Part-of-Speech Tagging]]." In: Proceedings of EMNLP Conference (EMNLP 1996). [http://acl.ldc.upenn.edu/W/W96/W96-0213.pdf]
 +
 +
Andrew McCallum, Dayne Freitag, and Fernando Pereira. (2000). "[[Frietag 2000 Maximum Entropy Markov Models for Information Extraction and Segmentation|Maximum Entropy Markov Models for Information Extraction and Segmentation]]." In: Proceedings of ICML-2000. [http://www.cs.umass.edu/~mccallum/papers/memm-icml2000.ps]
 +
 +
Raymond Lau, Ronald Rosenfeld and Salim Roukos. "[[Lau et al HLT 1993|Adaptive Language Modeling Using the Maximum Entropy Principle]]". In Proceedings of the ARPA Human Language Technology Workshop, published as Human Language Technology, pages 108–113. Morgan Kaufmann, March 1993.

Latest revision as of 20:58, 29 September 2011

The Method and When to Use it

The Generalized Iterative Scaling (GIS) is a method that searches the exponential family of a Maximum Entropy solution of the form:

where the 's are some unknown constants to be found, that guarantees to converge to a solution. The 's of the solution would be such that will make satisfy all the constraints , of the equation:

GIS was proposed by Darroch Ratclif and is similar in form and computational cost to the expectation-maximization algorithm. Previous methods to estimate the maximum likelihood were limited to the restriction so, GIS, as the name states, generalized so that these functions could be any real number.

The Algorithm

GIS starts with arbitrary values, wich define the initial probability estimate:

Each next iteration is intended to create an estimate, that will match the constraints better than the last one. Each iteration follows the steps:

  • (1) Compute the expectations of all the 's under the current estimate function, i.e.,


  • (2) Compare the present values with the desired ones, updating the 's in the following manner:


  • (3) Set the new estimate function:



  • (4) If convergence or near-convergence is reached stop; otherwise go back to step (1)

Intrinsic characteristics

GIS has three advantages when compared to other methods: it is able to incorporate feature selection, scales up well in number of features and is resilient to feature dependence.

On the other hand GIS has problems with smoothing and is relatively slow in training when compared to other classification methods.

Related Papers

John N. Darroch, and Douglas Ratcliff. (1972). "Generalized Iterative Scaling for Log-Linear Models." In: The Annals of Mathematical Statistics, 43(5). [1]

Adwait Ratnaparkhi. (1996). "A Maximum Entropy Model for Part-of-Speech Tagging." In: Proceedings of EMNLP Conference (EMNLP 1996). [2]

Andrew McCallum, Dayne Freitag, and Fernando Pereira. (2000). "Maximum Entropy Markov Models for Information Extraction and Segmentation." In: Proceedings of ICML-2000. [3]

Raymond Lau, Ronald Rosenfeld and Salim Roukos. "Adaptive Language Modeling Using the Maximum Entropy Principle". In Proceedings of the ARPA Human Language Technology Workshop, published as Human Language Technology, pages 108–113. Morgan Kaufmann, March 1993.