Difference between revisions of "Pair HMM"

From Cohen Courses
Jump to navigationJump to search
 
(8 intermediate revisions by the same user not shown)
Line 1: Line 1:
Stub [[Category::method]] page for pair HMMs, to be filled by [[User:Dmovshov|Dana Movshovitz-Attias]]
+
Pair HMMs are generative models that extend traditional [[HMM | HMMs]] by having two observation sequences instead of one. The pair HMM emits edit operations between pairs of observations, and by doing so, generates an alignment of the two sequences through the hidden states.
 +
 
 +
Pair HMMs models have been used for estimating string similarity, based on the [[String Edit Distance | string edit distance]] between the two observation sequences. They are also widely used for biological sequence analysis.
 +
 
 +
== Single-State Pair HMM ==
 +
A simple pair HMM can be defined over a single state which emits edit operations between two observation sequences, <math>x^T, y^V</math>, as well as an end symbol.
 +
* Let <math> E</math> be a set of edit operations, including deletions <math>(x, \epsilon)</math>, insertions <math>(\epsilon, y)</math> and substitutions <math>(x, y)</math>, and an end symbol <math>(\#)</math>.
 +
* Let <math>\theta_e</math> be the emission probability over the edit operation <math>e \in E</math>
 +
* Let <math>z^n</math> be a sequence of edit operations over <math>x</math> and <math>y</math>, and <math>r_k</math> is the hidden property of the edit operation <math>z_k</math>, representing the amount of symbols that have been consumed of each of the observation strings so far. <math>r_k \in \{1 .. T\}\times\{1 .. V\}</math>
 +
 
 +
== String Edit Distance based on a Pair HMM ==
 +
 
 +
Given a pair of input strings and a pair HMM, we can define two types of edit distances:
 +
* The Viterbi distance is defined as the most likely transduction between the two strings and it is given by
 +
<math>
 +
d(x^T, y^V) = - \log \arg\max_{Edit(z^n, x^T, y^V)} p(z^n | \Theta)
 +
</math>
 +
* The Stochastic distance is defined as the aggregation of all possible transductions between the two strings and is given by
 +
<math>
 +
d(x^T, y^V) = - \log \ p(x^T, y^V| \Theta)
 +
</math>
 +
 
 +
== Inference ==
 +
Inference on pair HMMs is done with versions of the [http://en.wikipedia.org/wiki/Forward_algorithm Forward] and Backward algorithms which track the possible edit operations over positions on both observable strings.
 +
 
 +
<math>
 +
\alpha(t,v) = p(x^t,y^v)
 +
= p(x^{t-1}, y^{v-1}) \cdot \theta_{(x_t,y_v)} + p(x^{t-1}, y^v) \cdot \theta_{(x_t ,\epsilon)} + p(x^t , y^{v-1}) \cdot \theta_{(\epsilon, y_v)}
 +
</math>
 +
 
 +
<math>
 +
= \alpha(t-1, v-1)\cdot\theta_{(x_t,y_v)} + \alpha( t-1, v ) \cdot \theta_{(x_t ,\epsilon)} + \alpha(t,v-1) \cdot \theta_{(\epsilon, y_v)}
 +
</math>
 +
 
 +
 
 +
And similarly:
 +
 
 +
<math>
 +
\beta(t,v) = p(x_{t+1}^T,y_{v+1}^V)
 +
</math>
 +
 
 +
== Multi-State Pair HMM ==
 +
The single-state definition can be extended to multiple states (for example, distincs states may be defined for Deletions, Insertions and Substitutions). The dynamic programming algorithms for inference and decoding used in the single-state case, are easily extended to the multi-state case by iterating over a third dimension (representing the states).
 +
 
 +
== Relevant Papers ==
 +
 
 +
{{#ask: [[UsesMethod::Pair HMM]]
 +
| ?AddressesProblem
 +
| ?UsesDataset
 +
}}
 +
 
 +
[[Category::method]]

Latest revision as of 08:44, 2 November 2011

Pair HMMs are generative models that extend traditional HMMs by having two observation sequences instead of one. The pair HMM emits edit operations between pairs of observations, and by doing so, generates an alignment of the two sequences through the hidden states.

Pair HMMs models have been used for estimating string similarity, based on the string edit distance between the two observation sequences. They are also widely used for biological sequence analysis.

Single-State Pair HMM

A simple pair HMM can be defined over a single state which emits edit operations between two observation sequences, , as well as an end symbol.

  • Let be a set of edit operations, including deletions , insertions and substitutions , and an end symbol .
  • Let be the emission probability over the edit operation
  • Let be a sequence of edit operations over and , and is the hidden property of the edit operation , representing the amount of symbols that have been consumed of each of the observation strings so far.

String Edit Distance based on a Pair HMM

Given a pair of input strings and a pair HMM, we can define two types of edit distances:

  • The Viterbi distance is defined as the most likely transduction between the two strings and it is given by

  • The Stochastic distance is defined as the aggregation of all possible transductions between the two strings and is given by

Inference

Inference on pair HMMs is done with versions of the Forward and Backward algorithms which track the possible edit operations over positions on both observable strings.


And similarly:

Multi-State Pair HMM

The single-state definition can be extended to multiple states (for example, distincs states may be defined for Deletions, Insertions and Substitutions). The dynamic programming algorithms for inference and decoding used in the single-state case, are easily extended to the multi-state case by iterating over a third dimension (representing the states).

Relevant Papers

method