Difference between revisions of "Maximum Entropy model"

From Cohen Courses
Jump to navigationJump to search
 
(4 intermediate revisions by the same user not shown)
Line 1: Line 1:
==copy and paste from wikipedia==
+
This is a [[category::method]].  
==head==
 
 
 
 
 
 
 
{{About|the probability theoretic principle|the classifier in [[machine learning]]|maximum entropy classifier|other uses|maximum entropy (disambiguation)}}
 
{{more footnotes|date=September 2008}}
 
 
 
{{Bayesian statistics}}
 
 
 
In [[Bayesian probability|Bayesian probability theory]], the '''principle of maximum entropy''' is an [[axiom]]. It states that, subject to precisely stated prior data, which must be a [[proposition]] that expresses ''[[#Testable information|testable information]]'', the [[probability distribution]] which best represents the current state of knowledge is the one with largest [[Entropy (information theory)|information-theoretical entropy]].
 
 
 
Let some precisely stated prior data or testable information about a probability distribution function be given. Consider the set of all trial probability distributions that encode the prior data.  Of those, the one that maximizes the [[information entropy]] is the proper probability distribution under the given prior data.
 
  
 
==History==
 
==History==
The principle was first expounded by [[E.T. Jaynes]] in two papers in 1957<ref>{{cite journal
+
The principle was first expounded by E.T. Jaynes in two papers in 1957 where he emphasized a natural correspondence between statistical mechanics and information theory. In particular, Jaynes offered a new and very general rationale why the Gibbsian method of statistical mechanics works. He argued that the entropy of statistical mechanics and the information entropy of information theory are principally the same thing. Consequently, statistical mechanics should be seen just as a particular application of a general tool of logical inference and information theory.
|last=Jaynes |first=E. T. |authorlink = Edwin Thompson Jaynes
 
|year=1957
 
|title=Information Theory and Statistical Mechanics
 
|url=http://bayes.wustl.edu/etj/articles/theory.1.pdf
 
|journal=[[Physical Review]] Series II
 
|volume=106 |issue=4 |pages=620–630
 
|doi=10.1103/PhysRev.106.620 |mr=87305
 
|bibcode = 1957PhRv..106..620J }}</ref><ref>{{cite journal
 
|last=Jaynes |first=E. T. |authorlink = Edwin Thompson Jaynes
 
|year=1957
 
|title=Information Theory and Statistical Mechanics II
 
|url=http://bayes.wustl.edu/etj/articles/theory.2.pdf
 
|journal=[[Physical Review]] Series II
 
|volume=108 |issue=2 |pages=171–190
 
|doi=10.1103/PhysRev.108.171  |mr=96414
 
|bibcode = 1957PhRv..108..171J }}</ref> where he emphasized a natural correspondence between [[statistical mechanics]] and [[information theory]]. In particular, Jaynes offered a new and very general rationale why the Gibbsian method of statistical mechanics works. He argued that the [[entropy]] of statistical mechanics and the [[information entropy]] of [[information theory]] are principally the same thing. Consequently, [[statistical mechanics]] should be seen just as a particular application of a general tool of logical [[inference]] and information theory.
 
  
 
==Overview==
 
==Overview==
In most practical cases, the stated prior data or testable information is given by a set of [[conserved quantities]] (average values of some moment functions), associated with the [[probability distribution]] in question. This is the way the maximum entropy principle is most often used in [[statistical thermodynamics]]. Another possibility is to prescribe some [[symmetries]] of the probability distribution. An equivalence between the [[conserved quantities]] and corresponding [[symmetry groups]] implies the same level of equivalence for both these two ways of specifying the testable information in the maximum entropy method.
+
In most practical cases, the stated prior data or testable information is given by a set of conserved quantities (average values of some moment functions), associated with the probability distribution in question. This is the way the maximum entropy principle is most often used in statistical thermodynamics. Another possibility is to prescribe some symmetries of the probability distribution. An equivalence between the conserved quantities and corresponding symmetry groups implies the same level of equivalence for both these two ways of specifying the testable information in the maximum entropy method.
  
The maximum entropy principle is also needed to guarantee the uniqueness and consistency of probability assignments obtained by different methods, [[statistical mechanics]] and [[logical inference]] in particular. Strictly speaking, the trial distributions, which do not maximize the entropy, are actually not ''probability'' distributions.
+
The maximum entropy principle is also needed to guarantee the uniqueness and consistency of probability assignments obtained by different methods, statistical mechanics and logical inference in particular. Strictly speaking, the trial distributions, which do not maximize the entropy, are actually not ''probability'' distributions.
  
The maximum entropy principle makes explicit our freedom in using different forms of [[prior information|prior data]]. As a special case, a uniform [[prior probability]] density (Laplace's [[principle of indifference]]) may be adopted. Thus, the maximum entropy principle is not just an ''alternative'' to the methods of inference of classical statistics, but it is an important conceptual generalization of those methods.
+
The maximum entropy principle makes explicit our freedom in using different forms of prior data. As a special case, a uniform prior probability density (Laplace's principle of indifference) may be adopted. Thus, the maximum entropy principle is not just an ''alternative'' to the methods of inference of classical statistics, but it is an important conceptual generalization of those methods.
  
 
In ordinary language, the principle of maximum entropy can be said to express a claim of epistemic modesty, or of maximum ignorance. The selected distribution is the one that makes the least claim to being informed beyond the stated prior data, that is to say the one that admits the most ignorance beyond the stated prior data.
 
In ordinary language, the principle of maximum entropy can be said to express a claim of epistemic modesty, or of maximum ignorance. The selected distribution is the one that makes the least claim to being informed beyond the stated prior data, that is to say the one that admits the most ignorance beyond the stated prior data.
 +
== Relevant Papers ==
  
==Testable information==
+
{{#ask: [[UsesMethod::Expectation-maximization algorithm]]
The principle of maximum entropy is useful explicitly only when applied to ''testable information''. A piece of information is testable if it can be determined whether a given distribution is consistent with it. For example, the statements
+
| ?AddressesProblem
 
+
| ?UsesDataset
:The [[Expected value|expectation]] of the variable ''x'' is 2.87
 
and
 
:''p''<sub>2</sub> + ''p''<sub>3</sub> > 0.6
 
 
 
are statements of testable information.
 
 
 
Given testable information, the maximum entropy procedure consists of seeking the [[probability distribution]] which maximizes [[information entropy]], subject to the constraints of the information. This constrained optimization problem is typically solved using the method of [[Lagrange multiplier]]s.
 
 
 
Entropy maximization with no testable information takes place under a single constraint: the sum of the probabilities must be one. Under this constraint, the maximum entropy discrete probability distribution is the [[uniform distribution (discrete)|uniform distribution]],
 
 
 
:<math>p_i=\frac{1}{n}\ {\rm for\ all}\ i\in\{\,1,\dots,n\,\}.</math>
 
 
 
The principle of maximum entropy can thus be seen as a generalization of the classical [[principle of indifference]], also known as the principle of insufficient reason.
 
 
 
==Applications==
 
The principle of maximum entropy is commonly applied in two ways to inferential problems:
 
 
 
===Prior probabilities===
 
The principle of maximum entropy is often used to obtain [[prior probability|prior probability distributions]] for [[Bayesian inference]]. Jaynes was a strong advocate of this approach, claiming the maximum entropy distribution represented the least informative distribution.<ref>{{cite journal
 
|last=Jaynes |first=E. T. |authorlink = Edwin Thompson Jaynes
 
|year=1968
 
|url=http://bayes.wustl.edu/etj/articles/brandeis.pdf
 
|format=PDF or [http://bayes.wustl.edu/etj/articles/brandeis.ps.gz PostScript]
 
|title=Prior Probabilities
 
|journal=IEEE Transactions on Systems Science and Cybernetics
 
|volume=4 |issue=3 |pages=227–241
 
|doi=10.1109/TSSC.1968.300117
 
}}</ref>
 
A large amount of literature is now dedicated to the elicitation of maximum entropy priors and links with channel coding.<ref>{{cite journal
 
|last=Clarke |first=B.
 
|year=2006
 
|title=Information optimality and Bayesian modelling
 
|journal=[[Journal of Econometrics]]
 
|volume=138 |issue=2 |pages=405–429
 
|doi=10.1016/j.jeconom.2006.05.003
 
}}</ref><ref>{{cite journal
 
|doi=10.2307/2669786
 
|last=Soofi |first=E.S.
 
|year=2000
 
|title=Principal Information Theoretic Approaches
 
|journal=[[Journal of the American Statistical Association]]
 
|volume=95 |issue=452 |pages=1349–1353
 
|mr=1825292 |jstor=2669786
 
}}</ref><ref>{{cite journal
 
|last=Bousquet |first=N.
 
|year=2008
 
|title=Eliciting vague but proper maximal entropy priors in Bayesian experiments
 
|journal=Statistical Papers
 
|volume=51
 
|issue=3
 
|doi=10.1007/s00362-008-0149-9
 
|pages=613–628
 
}}</ref>
 
 
 
===Maximum entropy models===
 
Alternatively, the principle is often invoked for model specification: in this case the observed data itself is assumed to be the testable information. Such models are widely used in [[natural language processing]]. An example of such a model is [[logistic regression]], which corresponds to the maximum entropy classifier for independent observations.
 
 
 
==General solution for the maximum entropy distribution with linear constraints==
 
{{main|maximum entropy probability distribution}}
 
 
 
===Discrete case===
 
We have some testable information ''I'' about a quantity ''x'' taking values in {''x<sub>1</sub>'', ''x<sub>2</sub>'',..., ''x<sub>n</sub>''}. We express this information as ''m'' constraints on the expectations of the functions ''f<sub>k</sub>''; that is, we require our probability distribution to satisfy
 
 
 
:<math>\sum_{i=1}^n \Pr(x_i|I)f_k(x_i) = F_k \qquad k = 1, \ldots,m.</math>
 
 
 
Furthermore, the probabilities must sum to one, giving the constraint
 
 
 
:<math>\sum_{i=1}^n \Pr(x_i|I) = 1.</math>
 
 
 
The probability distribution with maximum information entropy subject to these constraints is
 
 
 
:<math>\Pr(x_i|I) = \frac{1}{Z(\lambda_1,\ldots, \lambda_m)} \exp\left[\lambda_1 f_1(x_i) + \cdots + \lambda_m f_m(x_i)\right]</math>
 
 
 
It is sometimes called the [[Gibbs distribution]]. The normalization constant is determined by
 
 
 
:<math> Z(\lambda_1,\ldots, \lambda_m) = \sum_{i=1}^n \exp\left[\lambda_1 f_1(x_i) + \cdots + \lambda_m f_m(x_i)\right],</math>
 
 
 
and is conventionally called the [[partition function (mathematics)|partition function]].  (Interestingly, the [[Pitman&ndash;Koopman theorem]] states that the necessary and sufficient condition for a sampling distribution to admit [[sufficiency (statistics)|sufficient statistics]] of bounded dimension is that it have the general form of a maximum entropy distribution.)
 
 
 
The λ<sub>k</sub> parameters are Lagrange multipliers whose particular values are determined by the constraints according to
 
 
 
:<math>F_k = \frac{\partial}{\partial \lambda_k} \log Z(\lambda_1,\ldots, \lambda_m).</math>
 
 
 
These ''m'' simultaneous equations do not generally possess a [[closed form solution]], and are usually solved by [[Numerical analysis|numerical methods]].
 
 
 
===Continuous case===
 
For [[continuous distribution]]s, the simple definition of Shannon entropy ceases to be so useful (see ''[[differential entropy]]'').  Instead [[E.T. Jaynes|Edwin Jaynes]] (1963, 1968, 2003) gave the following formula, which is closely related to the [[relative entropy]].
 
 
 
:<math>H_c=-\int p(x)\log\frac{p(x)}{m(x)}\,dx</math>
 
 
 
where ''m''(''x''), which Jaynes called the "invariant measure", is proportional to the [[limiting density of discrete points]]. For now, we shall assume that it is known; we will discuss it further after the solution equations are given. 
 
 
 
A closely related quantity, the relative entropy, is usually defined as the [[Kullback-Leibler divergence]] of ''m'' from ''p'' (although it is sometimes, confusingly, defined as the negative of this).  The inference principle of minimizing this, due to Kullback, is known as the [[Kullback-Leibler divergence#Principle of minimum discrimination information|Principle of Minimum Discrimination Information]].
 
 
 
We have some testable information ''I'' about a quantity ''x'' which takes values in some [[interval (mathematics)|interval]] of the [[real numbers]] (all integrals below are over this interval). We express this information as ''m'' constraints on the expectations of the functions ''f<sub>k</sub>'', i.e. we require our probability density function to satisfy
 
 
 
:<math>\int p(x|I)f_k(x)dx = F_k \qquad k = 1, \dotsc,m</math>
 
 
 
And of course, the probability density must integrate to one, giving the constraint
 
 
 
:<math>\int p(x|I)dx = 1</math>
 
 
 
The probability density function with maximum ''H<sub>c</sub>'' subject to these constraints is
 
 
 
:<math>p(x|I) = \frac{1}{Z(\lambda_1,\dotsc, \lambda_m)} m(x)\exp\left[\lambda_1 f_1(x) + \dotsb + \lambda_m f_m(x)\right]</math>
 
 
 
with the [[partition function (mathematics)|partition function]] determined by
 
 
 
:<math> Z(\lambda_1,\dotsc, \lambda_m) = \int m(x)\exp\left[\lambda_1 f_1(x) + \dotsb + \lambda_m f_m(x)\right]dx</math>
 
 
 
As in the discrete case, the values of the <math>\lambda_k</math> parameters are determined by the constraints according to
 
 
 
:<math>F_k = \frac{\partial}{\partial \lambda_k} \log Z(\lambda_1,\dotsc, \lambda_m)</math>
 
 
 
The invariant measure function ''m''(''x'') can be best understood by supposing that ''x'' is known to take values only in the [[bounded interval]] (''a'', ''b''), and that no other information is given. Then the maximum entropy probability density function is
 
 
 
:<math> p(x|I) = A \cdot m(x), \qquad a < x < b</math>
 
 
 
where ''A'' is a normalization constant. The invariant measure function is actually the prior density function encoding 'lack of relevant information'.  It cannot be determined by the principle of maximum entropy, and must be determined by some other logical method, such as the [[principle of transformation groups]] or [[Marginalization (probability)|marginalization theory]].
 
 
 
===Examples===
 
For several examples of maximum entropy distributions, see the article on [[maximum entropy probability distribution]]s.
 
 
 
==Justifications for the principle of maximum entropy==
 
Proponents of the principle of maximum entropy justify its use in assigning probabilities in several ways, including the following two arguments. These arguments take the use of [[Bayesian probability]] as given, and are thus subject to the same postulates.
 
 
 
===Information entropy as a measure of 'uninformativeness'===
 
Consider a '''discrete probability distribution''' among ''m'' mutually exclusive [[proposition]]s. The most informative distribution would occur when one of the propositions was known to be true. In that case, the information entropy would be equal to zero. The least informative distribution would occur when there is no reason to favor any one of the propositions over the others. In that case, the only reasonable probability distribution would be uniform, and then the information entropy would be equal to its maximum possible value,
 
log ''m''. The information entropy can therefore be seen as a numerical measure which describes how uninformative a particular probability distribution is, ranging from zero (completely informative) to log ''m'' (completely uninformative).
 
 
 
By choosing to use the distribution with the maximum entropy allowed by our information, the argument goes, we are choosing the most uninformative distribution possible. To choose a distribution with lower entropy would be to assume information we do not possess; to choose one with a higher entropy would violate the constraints of the information we ''do'' possess. Thus the maximum entropy distribution is the only reasonable distribution.
 
 
 
===The Wallis derivation===
 
The following argument is the result of a suggestion made by [[Graham Wallis]] to E. T. Jaynes in 1962.<ref name=Jaynes2003/> It is essentially the same mathematical argument used for the [[Maxwell-Boltzmann statistics]] in [[statistical mechanics]], although the conceptual emphasis is quite different. It has the advantage of being strictly combinatorial in nature, making no reference to information entropy as a measure of 'uncertainty', 'uninformativeness', or any other imprecisely defined concept. The information entropy function is not assumed ''a priori'', but rather is found in the course of the argument; and the argument leads naturally to the procedure of maximizing the information entropy, rather than treating it in some other way.
 
 
 
Suppose an individual wishes to make a probability assignment among ''m''  [[mutually exclusive]] propositions. She has some testable information, but is not sure how to go about including this information in her probability assessment. She therefore conceives of the following random experiment. She will distribute ''N'' quanta of probability (each worth 1/''N'') at random among the ''m'' possibilities. (One might imagine that she will throw ''N'' balls into ''m'' buckets while blindfolded. In order to be as fair as possible, each throw is to be independent of any other, and every bucket is to be the same size.) Once the experiment is done, she will check if the probability assignment thus obtained is consistent with her information. If not, she will reject it and try again. Otherwise, her assessment will be
 
 
 
:<math>p_i = \frac{n_i}{N}</math>
 
 
 
where ''p<sub>i</sub>'' is the probability of the ''i''<sup>th</sup> proposition, while ''n<sub>i</sub>'' is the number of quanta that were assigned to the ''i''<sup>th</sup> proposition (if the individual in our experiment carries out the ball throwing experiment, then ''n<sub>i</sub>'' is the number of balls that ended up in bucket ''i'').
 
 
 
Now, in order to reduce the 'graininess' of the probability assignment, it will be necessary to use quite a large number of quanta of probability. Rather than actually carry out, and possibly have to repeat, the rather long random experiment, the protagonist decides to simply calculate and use the most probable result. The probability of any particular result is the [[multinomial distribution]],
 
 
 
:<math>Pr(\mathbf{p}) = W \cdot m^{-N}</math>
 
 
 
where
 
 
 
:<math>W = \frac{N!}{n_1 !n_2 !\dotso n_m!}</math>
 
 
 
is sometimes known as the multiplicity of the outcome.
 
 
 
The most probable result is the one which maximizes the multiplicity ''W''. Rather than maximizing ''W'' directly, the protagonist could equivalently maximize any monotonic increasing function of ''W''. She decides to maximize
 
 
 
:<math>\begin{matrix}\frac{1}{N}\log W
 
&=& \frac{1}{N}\log \frac{N!}{n_1 !n_2 !\dotso n_m!}\qquad\qquad\qquad\qquad\qquad \\ \\ \
 
&=& \frac{1}{N}\log \frac{N!}{Np_1 !Np_2 !\dotso Np_m!} \qquad\qquad\qquad\qquad\\ \\ \
 
&=& \frac{1}{N}\left( \log N! - \sum_{i=1}^m \log Np_i! \right) \qquad\qquad\end{matrix}</math>
 
 
 
At this point, in order to simplify the expression, the protagonist takes the limit as <math>N\to\infty</math>, i.e. as the probability levels go from grainy  discrete values to smooth continuous values. Using [[Stirling's approximation]], she finds
 
 
 
:<math>\begin{matrix}\lim_{N \to \infty}\left(\frac{1}{N}\log W\right)
 
&=& \frac{1}{N}\left( N\log N - \sum_{i=1}^m Np_i\log Np_i \right)\qquad\qquad\qquad\qquad \\  \\  \
 
&=& \log N - \sum_{i=1}^m p_i\log Np_i \qquad\qquad\qquad\qquad\qquad\qquad \\  \\  \
 
&=& \log N - \log N \sum_{i=1}^m p_i - \sum_{i=1}^m p_i\log p_i \qquad\qquad\qquad \\  \\  \
 
&=& \left(1 - \sum_{i=1}^m p_i \right)\log N - \sum_{i=1}^m p_i\log p_i \qquad\qquad\qquad \\  \\  \
 
&=& - \sum_{i=1}^m p_i\log p_i  \qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad \\  \\  \
 
&=& H(\mathbf{p}) \qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad
 
\end{matrix}</math>
 
 
 
All that remains for the protagonist to do is to maximize entropy under the constraints of her testable information. She has found that the maximum entropy distribution is the most probable of all "fair" random distributions, in the limit as the probability levels go from discrete to continuous.
 
 
 
===Compatibility with Bayes' theorem===
 
Giffin et al. (2007) state that [[Bayes' theorem]] and the Principle of Maximum Entropy (MaxEnt) are completely compatible and can be seen as special cases of the Method of Maximum (relative) Entropy. They state that this method reproduces every aspect of orthodox Bayesian inference methods. In addition this new method opens the door to tackling problems that could not be addressed by either the MaxEnt or orthodox Bayesian methods individually. Moreover, recent contributions (Lazar 2003, and Schennach 2005) show that frequentist relative-entropy-based inference approaches (such as [[empirical likelihood]] and [[exponentially tilted empirical likelihood]] - see e.g. Owen 2001 and Kitamura 2006) can be combined with prior information to perform Bayesian posterior analysis.
 
 
 
Jaynes stated [[Bayes' theorem]] was a way to calculate a probability, while maximum entropy was a way to assign a prior probability distribution.<ref name=Jaynes1988/>
 
 
 
==See also==
 
*[[Entropy maximization]]
 
*[[Maximum entropy classifier]]
 
*[[Maximum entropy probability distribution]]
 
*[[Maximum entropy spectral estimation]]
 
*[[Maximum entropy thermodynamics]]
 
 
 
==Notes==
 
{{reflist|refs=
 
 
 
<ref name=Jaynes1988>Jaynes, E. T. (1988) [http://bayes.wustl.edu/etj/articles/relationship.pdf "The Relation of Bayesian and Maximum Entropy Methods"], in ''Maximum-Entropy and Bayesian Methods in Science and Engineering (Vol. 1)'', Kluwer Academic Publishers, p. 25-29.</ref>
 
 
 
<ref name=Jaynes2003>Jaynes, E. T. (2003) ''Probability Theory: The Logic of Science'', Cambridge University Press. ISBN 978-0521592710 {{page needed|date=July 2012}}</ref>
 
 
 
 
}}
 
}}
 
==References==
 
* {{cite book
 
|last=Jaynes |first=E. T. |authorlink = Edwin Thompson Jaynes
 
|year=1963
 
|url=http://bayes.wustl.edu/etj/node1.html
 
|chapter=Information Theory and Statistical Mechanics
 
|title=Statistical Physics
 
|editor=Ford, K. (ed.)
 
|publisher=Benjamin |location=New York |page=181
 
}}
 
* Jaynes, E. T., 1986 (new version online 1996), [http://bayes.wustl.edu/etj/articles/cmonkeys.pdf 'Monkeys, kangaroos and <math>N</math>'], in ''Maximum-Entropy and Bayesian Methods in Applied Statistics'', J. H. Justice (ed.), Cambridge University Press, Cambridge, p.&nbsp;26.
 
* Bajkova, A. T., 1992, ''The generalization of maximum entropy method for reconstruction of complex functions''. Astronomical and Astrophysical Transactions, V.1, issue 4, p.&nbsp;313-320.
 
* Giffin, A. and Caticha, A., 2007, [http://arxiv.org/abs/0708.1593 ''Updating Probabilities with Data and Moments'']
 
* Guiasu, S. and Shenitzer, A., 1985,  'The principle of maximum entropy',  The Mathematical Intelligencer, '''7'''(1), 42-48.
 
* Harremoës P. and Topsøe F., 2001, ''Maximum Entropy Fundamentals'', Entropy, 3(3), 191-226.
 
* Kapur, J. N.; and Kesevan, H. K., 1992, ''Entropy optimization principles with applications'', Boston: Academic Press. ISBN 0-12-397670-7
 
* Kitamura, Y., 2006, [http://cowles.econ.yale.edu/P/cd/d15b/d1569.pdf ''Empirical Likelihood Methods in Econometrics: Theory and Practice''], Cowles Foundation Discussion Papers 1569, Cowles Foundation, Yale University.
 
* Lazar, N., 2003, "Bayesian Empirical Likelihood", Biometrika, 90, 319-326.
 
* Owen, A. B., ''Empirical Likelihood'', Chapman and Hall.
 
* Schennach, S. M., 2005, "Bayesian Exponentially Tilted Empirical Likelihood", Biometrika, 92(1), 31-46.
 
* Uffink, Jos, 1995, [http://www.phys.uu.nl/~wwwgrnsl/jos/mepabst/mep.pdf 'Can the Maximum Entropy Principle be explained as a consistency requirement?'], Studies in History and Philosophy of Modern Physics '''26B''', 223-261.
 
 
==Further reading==
 
* Ratnaparkhi A. (1997) [http://repository.upenn.edu/cgi/viewcontent.cgi?article=1083&context=ircs_reports "A simple introduction to maximum entropy models for natural language processing"] Technical Report 97-08, Institute for Research in Cognitive Science, University of Pennsylvania. An easy-to-read introduction to maximum entropy methods in the context of natural language processing.
 
 
* {{cite PMID|18184793}} Open access article containing pointers to various papers and software implementations of Maximum Entropy Model on the net.
 
 
==External links==
 
* [http://homepages.inf.ed.ac.uk/s0450736/maxent.html Maximum Entropy Modeling]  Links to publications, software and resources
 
 
[[Category:Entropy and information]]
 
[[Category:Statistical theory]]
 
[[Category:Bayesian statistics]]
 
[[Category:Statistical principles]]
 
[[Category:Probability assessment]]
 
[[Category:Mathematical principles]]
 
 
[[de:Maximum-Entropie-Methode]]
 
[[es:Principio de máximo de entropía]]
 
[[fr:Principe d'entropie maximale]]
 
[[ja:最大エントロピー原理]]
 
[[pt:Máxima Entropia]]
 

Latest revision as of 09:16, 27 September 2012

This is a method.

History

The principle was first expounded by E.T. Jaynes in two papers in 1957 where he emphasized a natural correspondence between statistical mechanics and information theory. In particular, Jaynes offered a new and very general rationale why the Gibbsian method of statistical mechanics works. He argued that the entropy of statistical mechanics and the information entropy of information theory are principally the same thing. Consequently, statistical mechanics should be seen just as a particular application of a general tool of logical inference and information theory.

Overview

In most practical cases, the stated prior data or testable information is given by a set of conserved quantities (average values of some moment functions), associated with the probability distribution in question. This is the way the maximum entropy principle is most often used in statistical thermodynamics. Another possibility is to prescribe some symmetries of the probability distribution. An equivalence between the conserved quantities and corresponding symmetry groups implies the same level of equivalence for both these two ways of specifying the testable information in the maximum entropy method.

The maximum entropy principle is also needed to guarantee the uniqueness and consistency of probability assignments obtained by different methods, statistical mechanics and logical inference in particular. Strictly speaking, the trial distributions, which do not maximize the entropy, are actually not probability distributions.

The maximum entropy principle makes explicit our freedom in using different forms of prior data. As a special case, a uniform prior probability density (Laplace's principle of indifference) may be adopted. Thus, the maximum entropy principle is not just an alternative to the methods of inference of classical statistics, but it is an important conceptual generalization of those methods.

In ordinary language, the principle of maximum entropy can be said to express a claim of epistemic modesty, or of maximum ignorance. The selected distribution is the one that makes the least claim to being informed beyond the stated prior data, that is to say the one that admits the most ignorance beyond the stated prior data.

Relevant Papers