Difference between revisions of "Bilenko and Mooney 2003 Adaptive duplicate detection using learnable string similarity measures"

From Cohen Courses
Jump to navigationJump to search
Line 15: Line 15:
 
# The first extends the [[AddressesProblem::String Edit Distance]] as suggested by [[RelatedPaper::Ristad_and_Yianilos_1997_Learning_String_Edit_Distance | Ristad and Yianilos]] to include affine gaps.
 
# The first extends the [[AddressesProblem::String Edit Distance]] as suggested by [[RelatedPaper::Ristad_and_Yianilos_1997_Learning_String_Edit_Distance | Ristad and Yianilos]] to include affine gaps.
 
# The second metric measures similarity based on unordered bags of words, using an [[UsesMethod::Support Vector Machines | SVM]] for training.
 
# The second metric measures similarity based on unordered bags of words, using an [[UsesMethod::Support Vector Machines | SVM]] for training.
 +
 +
== Methods used ==
 +
===String edit distance with affine gap ===
 +
The first method extends the [[AddressesProblem::String Edit Distance]] measure suggested by [[RelatedPaper::Ristad_and_Yianilos_1997_Learning_String_Edit_Distance | Ristad and Yianilos]] with an affine gap model, inspired by [[RelatedPaper::Needleman_and_Wunsch_1970_A_general_method_applicable_to_the_search_for_similarities_in_the_amino_acid_sequences_of_two_proteins | Needleman and Wunsch]]. In the affine model:
 +
:<math> cost(g) = s+e*l </math>
 +
where <math>s</math> is the cost of opening a gap, <math>e</math> is the cost of extending the gap and <math>l</math> is the gap length.
 +
 +
The formulation follows directly from [[RelatedPaper::Ristad_and_Yianilos_1997_Learning_String_Edit_Distance | Ristad and Yianilos]], modeling the string edit distance as an alignment problem. The affine-gap alignment is modeled as a stochastic transducer with four states: (1) <math>M</math>, represents a match (or substitution between characters), (2) <math>D</math>, represents a deletion, (3) <math>I</math>, represents an insertion, and (4) <math>\#</math>, represents the end of the alignment.
 +
 +
This model successfully represents pairs of short strings with minor local variations, but is not as useful for longer strings.
 +
 +
=== Vector space similarity ===
 +
The second measure attempts to capture the relative importance of tokens and so to capture more global variations between long strings. In this model, a pair of strings, <math>x</math> and <math>y</math> is represented by a vector, <math>p</math>, of unordered bags of words with TF-IDF based weights.
 +
:<math>p^{(x,y)}_i = \frac{x_i y_i}{||x|| \cdot ||y||}</math>

Revision as of 13:59, 30 September 2011

... Under construction by Dana Movshovitz-Attias

Citation

Bilenko, M. and Mooney, R.J., Adaptive duplicate detection using learnable string similarity measures.Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp.39--48, 2003.

Online

Summary

This paper addresses the problem of Duplicate Document Detection by using string similarity measures. In contrast to previous methods that used generic or manually tuned distance metrics, in this paper, the authors suggest using learnable (trainable) text distance functions and training a specific function for different data fields. Such specialized function can capture a unique notion of similarity as is relevant for the specific data represented by a specific field.

Two similarity metrics are suggested:

  1. The first extends the String Edit Distance as suggested by Ristad and Yianilos to include affine gaps.
  2. The second metric measures similarity based on unordered bags of words, using an SVM for training.

Methods used

String edit distance with affine gap

The first method extends the String Edit Distance measure suggested by Ristad and Yianilos with an affine gap model, inspired by Needleman and Wunsch. In the affine model:

where is the cost of opening a gap, is the cost of extending the gap and is the gap length.

The formulation follows directly from Ristad and Yianilos, modeling the string edit distance as an alignment problem. The affine-gap alignment is modeled as a stochastic transducer with four states: (1) , represents a match (or substitution between characters), (2) , represents a deletion, (3) , represents an insertion, and (4) , represents the end of the alignment.

This model successfully represents pairs of short strings with minor local variations, but is not as useful for longer strings.

Vector space similarity

The second measure attempts to capture the relative importance of tokens and so to capture more global variations between long strings. In this model, a pair of strings, and is represented by a vector, , of unordered bags of words with TF-IDF based weights.