Difference between revisions of "McCallum Bellare and Pereira 2005 A Conditional Random Field for Discriminatively-Trained Finite-State String Edit Distance"

Revision as of 18:41, 29 November 2011

Citation

Andrew McCallum, Kedar Bellare and Fernando Pereira. A Conditional Random Field for Discriminatively-Trained Finite-State String Edit Distance. Conference on Uncertainty in AI (UAI'05), 2005.

Online

PDF version.

Summary

This paper presents a discriminative string edit CRF. Conditional random fields outperform generative models for string edit distance prediction (such as those used in the work of Ristad and Yianilos and Bilenko and Mooney) since they are able to use complex, arbitrary features over the input strings. As in the case of the generative models, this method does not require to specify the edit sequence (alignment) of the training strings, however, it does require negative input instances.

The results show that the CRF approach outperforms the generative models on all but one of the estimated classes.

Method

The CRF presented in this paper is based on a finite state machine (FST) model with a single start state and two sets of disjoint states, $S_{1}$ and $S_{0}$ , representing 'matching' and 'non-matching' string pairs, respectively.

An alignment under this model is a four-tuple

$a=\langle e,ix,jy,q\rangle$

where,

$e=e_{1},\dots ,e_{k}$ . These indicate the sequence of edit-operations for this alignment.

Each edit operation $e_{p}$ consumes a portion of either one or both of the input strings, up to position $i_{p}$ or $j_{p}$ .

$ix=i_{1},\dots ,i_{k}$ . This is the non-decreasing positions in the input string x, associated with the edit operations in $e$ .
$jy=j_{1},\dots ,j_{k}$ . This is the non-decreasing positions in the input string y, associated with the edit operations in $e$ .
$q=q_{1},\dots ,q_{k}$ . This is the set of states in the FST models of this alignment. By construction, either $q\subseteq S_{1}$ (indicating a sequence of match states), or $q\subseteq S_{0}$ (indicating a sequence of non-match states).

Let $a_{p}=\langle e_{p},i_{p},j_{p},q_{p}\rangle$ , be the $p$ -th position in the alignment.

Given the strings x, and y, the probability of an alignment a is given by

$p(\mathbf {a} ||\mathbf {x} ,\mathbf {y} )={\frac {1}{Z_{x,y}}}\prod _{i=1}^{|a|}\Phi (a_{i-1},a_{i},\mathbf {x} ,\mathbf {y} )$

@@ Line 10: / Line 10: @@
 This [[Category::paper]] presents a discriminative string edit CRF. Conditional random fields outperform generative models for [[AddressesProblem::String Edit Distance | string edit distance]] prediction (such as those used in the work of [[RelatedPaper::Ristad_and_Yianilos_1997_Learning_String_Edit_Distance | Ristad and Yianilos]] and [[RelatedPaper::Bilenko_and_Mooney_2003_Adaptive_duplicate_detection_using_learnable_string_similarity_measures | Bilenko and Mooney]]) since they are able to use complex, arbitrary features over the input strings. As in the case of the generative models, this method does not require to specify the edit sequence (alignment) of the training strings, however, it does require negative input instances.
-The CRF presented in this paper is based on a finite state machine model with a single start state and two sets of disjoint states representing 'matching' and 'non-matching' string pairs.
+The results show that the CRF approach outperforms the generative models on all but one of the estimated classes.
-== Methods ==
+== Method ==
+The CRF presented in this paper is based on a finite state machine (FST) model with a single start state and two sets of disjoint states, <math>S_1</math> and <math>S_0</math>, representing 'matching' and 'non-matching' string pairs, respectively.
+An alignment under this model is a four-tuple
+<math>
+a = \langle e,ix,jy,q  \rangle
+</math>
+where,
+* <math>e=e_1, \dots, e_k</math>. These indicate the sequence of edit-operations for this alignment.
+Each edit operation <math>e_p</math> consumes a portion of either one or both of the input strings, up to position <math>i_p</math> or <math>j_p</math>.
+* <math>ix=i_1, \dots, i_k</math>. This is the non-decreasing positions in the input string '''x''', associated with the edit operations in <math>e</math>.
+* <math>jy=j_1, \dots, j_k</math>. This is the non-decreasing positions in the input string '''y''', associated with the edit operations in <math>e</math>.
+* <math>q=q_1, \dots, q_k</math>. This is the set of states in the FST models of this alignment. By construction, either <math> q \subseteq S_1 </math> (indicating a sequence of match states), or <math> q \subseteq S_0 </math> (indicating a sequence of non-match states).
+Let <math> a_p = \langle e_p,i_p,j_p,q_p  \rangle </math>, be the <math>p</math>-th position in the alignment.
+Given the strings '''x''', and '''y''', the probability of an alignment '''a''' is given by
+<math>
+p(\mathbf{a}||\mathbf{x},\mathbf{y}) = \frac{1}{Z_{x,y}}\prod^{|a|}_{i=1} \Phi(a_{i-1},a_i,\mathbf{x},\mathbf{y})
+</math>

Difference between revisions of "McCallum Bellare and Pereira 2005 A Conditional Random Field for Discriminatively-Trained Finite-State String Edit Distance"

Revision as of 18:41, 29 November 2011

Contents

Citation

Online

Summary

Method

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools