Difference between revisions of "McCallum Bellare and Pereira 2005 A Conditional Random Field for Discriminatively-Trained Finite-State String Edit Distance"
(5 intermediate revisions by the same user not shown) | |||
Line 8: | Line 8: | ||
== Summary == | == Summary == | ||
− | This [[Category::paper]] presents a discriminative string edit CRF. | + | This [[Category::paper]] presents a discriminative string edit CRF. The authors show that conditional random fields outperform generative models for [[AddressesProblem::String Edit Distance | string edit distance]] prediction (such as the [[HMM]] and [[Pair HMM]] based methods described in [[RelatedPaper::Ristad_and_Yianilos_1997_Learning_String_Edit_Distance | Ristad and Yianilos]] and [[RelatedPaper::Bilenko_and_Mooney_2003_Adaptive_duplicate_detection_using_learnable_string_similarity_measures | Bilenko and Mooney]]) since they are able to use complex, arbitrary features over the input strings. As in the case of the generative models, this method does not require to specify the edit sequence (alignment) of the training strings, however, it does require negative input instances. |
The results show that the CRF approach outperforms the generative models on all but one of the estimated classes. | The results show that the CRF approach outperforms the generative models on all but one of the estimated classes. | ||
Line 14: | Line 14: | ||
== Method == | == Method == | ||
=== The model === | === The model === | ||
− | The CRF presented in this paper is based on a finite state machine ( | + | The CRF presented in this paper is based on a finite state machine (FSM) model with a single start state and two sets of disjoint states, <math>S_1</math> and <math>S_0</math>, representing 'matching' and 'non-matching' string pairs, respectively. |
An alignment under this model is a four-tuple | An alignment under this model is a four-tuple | ||
Line 25: | Line 25: | ||
* <math>e=e_1, \dots, e_k</math>. These indicate the sequence of edit-operations for this alignment. | * <math>e=e_1, \dots, e_k</math>. These indicate the sequence of edit-operations for this alignment. | ||
Each edit operation <math>e_p</math> consumes a portion of either one or both of the input strings, up to position <math>i_p</math> or <math>j_p</math>. | Each edit operation <math>e_p</math> consumes a portion of either one or both of the input strings, up to position <math>i_p</math> or <math>j_p</math>. | ||
− | * <math>ix=i_1, \dots, i_k</math>. | + | * <math>ix=i_1, \dots, i_k</math>. These are non-decreasing positions in the input string '''x''', associated with the edit operations in <math>e</math>. |
− | * <math>jy=j_1, \dots, j_k</math>. | + | * <math>jy=j_1, \dots, j_k</math>. These are non-decreasing positions in the input string '''y''', associated with the edit operations in <math>e</math>. |
− | * <math>q=q_1, \dots, q_k</math>. This is the set of states in the | + | * <math>q=q_1, \dots, q_k</math>. This is the set of states in the FSM model of this alignment. By construction, either <math> q \subseteq S_1 </math> (indicating a sequence of match states), or <math> q \subseteq S_0 </math> (indicating a sequence of non-match states). |
Line 55: | Line 55: | ||
</math> | </math> | ||
− | This is calculated with dynamic programming, to give a Forward-Backward-like match-estimate. Alternatively, a Viterbi estimate can be given by calculating the max-product. | + | This is calculated with dynamic programming, to give a [[UsesMethod::Forward-Backward]]-like match-estimate. Alternatively, a [http://en.wikipedia.org/wiki/Viterbi_algorithm Viterbi] estimate can be given by calculating the max-product. |
=== Learning === | === Learning === | ||
− | The training data consists of string pairs <math> \langle x^{(j)}, y^{(j)} \rangle </math> with corresponding labels <math> z^{(j)} \in {1,0}</math>, indicating a 'match' or 'mismatch' label. | + | The training data consists of string pairs <math> \langle x^{(j)}, y^{(j)} \rangle </math> with corresponding labels <math> z^{(j)} \in \{1,0\}</math>, indicating a 'match' or 'mismatch' label. |
The model parameters are estimated using maximum likelihood with penalization in the form of a zero-mean Gaussian prior: <math> \Sigma_k \lambda_k^2/\sigma^2</math> | The model parameters are estimated using maximum likelihood with penalization in the form of a zero-mean Gaussian prior: <math> \Sigma_k \lambda_k^2/\sigma^2</math> | ||
Line 67: | Line 67: | ||
</math> | </math> | ||
− | This is maximized using EM, and the M-step uses BFGS for optimization. | + | This is maximized using [[UsesMethod::Expectation_Maximization | EM]] , and the M-step uses [http://en.wikipedia.org/wiki/BFGS_method BFGS] for optimization. |
The initial model parameters are manually-tuned in order to avoid local-maxima. | The initial model parameters are manually-tuned in order to avoid local-maxima. |
Latest revision as of 19:27, 29 November 2011
Citation
Andrew McCallum, Kedar Bellare and Fernando Pereira. A Conditional Random Field for Discriminatively-Trained Finite-State String Edit Distance. Conference on Uncertainty in AI (UAI'05), 2005.
Online
Summary
This paper presents a discriminative string edit CRF. The authors show that conditional random fields outperform generative models for string edit distance prediction (such as the HMM and Pair HMM based methods described in Ristad and Yianilos and Bilenko and Mooney) since they are able to use complex, arbitrary features over the input strings. As in the case of the generative models, this method does not require to specify the edit sequence (alignment) of the training strings, however, it does require negative input instances.
The results show that the CRF approach outperforms the generative models on all but one of the estimated classes.
Method
The model
The CRF presented in this paper is based on a finite state machine (FSM) model with a single start state and two sets of disjoint states, and , representing 'matching' and 'non-matching' string pairs, respectively.
An alignment under this model is a four-tuple
where,
- . These indicate the sequence of edit-operations for this alignment.
Each edit operation consumes a portion of either one or both of the input strings, up to position or .
- . These are non-decreasing positions in the input string x, associated with the edit operations in .
- . These are non-decreasing positions in the input string y, associated with the edit operations in .
- . This is the set of states in the FSM model of this alignment. By construction, either (indicating a sequence of match states), or (indicating a sequence of non-match states).
Let , be the -th position in the alignment.
Given the strings x, and y, the probability of an alignment a is given by
where is a normalizer and is the following potential function, parametrized as an exponential of a linear scoring function
where f is a vector of feature functions over its arguments.
Inference
The probability of a match (or mismatch) is given by marginalizing over the alignments with states in the corresponding state set ( or )
This is calculated with dynamic programming, to give a Forward-Backward-like match-estimate. Alternatively, a Viterbi estimate can be given by calculating the max-product.
Learning
The training data consists of string pairs with corresponding labels , indicating a 'match' or 'mismatch' label.
The model parameters are estimated using maximum likelihood with penalization in the form of a zero-mean Gaussian prior:
The penalized log likelihood is given by
This is maximized using EM , and the M-step uses BFGS for optimization.
The initial model parameters are manually-tuned in order to avoid local-maxima.
Results
The results of this work are compared with the results of Bilenko and Mooney over six data sets. The CRF outperforms the previously used generative model in 5 out of 6 cases. Additionally, in all but one of the datasets, using the Forward-Backward alignment score has been shown to outperform the Viterbi alignment score.