Bilenko and Mooney 2003 Adaptive duplicate detection using learnable string similarity measures

From Cohen Courses
Jump to navigationJump to search

Citation

Bilenko, M. and Mooney, R.J., Adaptive duplicate detection using learnable string similarity measures.Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp.39--48, 2003.

Online

Summary

This paper addresses the problem of Duplicate Document Detection by using string similarity measures. In contrast to previous methods that used generic or manually tuned distance metrics, in this paper, the authors suggest using learnable (trainable) text distance functions and training a specific function for different data fields. Such specialized function can capture a unique notion of similarity as is relevant for the specific data represented by a specific field.

Two similarity metrics are suggested:

  1. The first extends the String Edit Distance as suggested by Ristad and Yianilos to include affine gaps.
  2. The second metric measures similarity based on unordered bags of words, using an SVM for training.

Methods used

String edit distance with affine gap

The first method extends the String Edit Distance measure suggested by Ristad and Yianilos with an affine gap model, inspired by Needleman and Wunsch. In the affine model:

where is the cost of opening a gap, is the cost of extending the gap and is the gap length.

The formulation follows directly from Ristad and Yianilos, modeling the string edit distance as an alignment problem. The affine-gap alignment is modeled as a stochastic transducer with four states: (1) , represents a match (or substitution between characters), (2) , represents a deletion, (3) , represents an insertion, and (4) , represents the end of the alignment.

This model successfully represents pairs of short strings with minor local variations, but is not as useful for longer strings.

Vector space similarity

The second measure attempts to capture the relative importance of tokens and so to capture more global variations between long strings. In this model, a pair of strings, and is represented by a vector, , of unordered bags of words with TF-IDF based weights.

These vectors are classified using SVM to one of two classes: , equivalent string pair, or , different string pair.

Combining similarity across multiple fields

The similarity metrics used above provide a score for the similarity of two strings for the same database field. In order to generate a single similarity score between records that contain different fields, some method is needed for combining the individual field scores. A pair of records in a database with fields and distance metrics can be represented by a -dimensional vector of the similarities of the fields by different metrics. SVM is used for classifying record-pairs to two classes of equivalent or different records.

Experimental results

The results indicate that using an affine-gap learned string edit distance outperforms the other tested methods, both for detecting duplications on a single field level as well as for records with multiple fields. The SVM-based vector-space model, however, lead to decreased performance over the baseline of a cosine-similarity based vector-space model (the baseline model was not using SVM but simply calculating the similarity measure between records).