Difference between revisions of "Bilenko and Mooney 2003 Adaptive duplicate detection using learnable string similarity measures"

From Cohen Courses
Jump to navigationJump to search
Line 1: Line 1:
 +
... Under construction by [[User:dmovshov | Dana Movshovitz-Attias]]
 
== Citation ==
 
== Citation ==
  

Revision as of 17:36, 29 September 2011

... Under construction by Dana Movshovitz-Attias

Citation

Bilenko, M. and Mooney, R.J., Adaptive duplicate detection using learnable string similarity measures.Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp.39--48, 2003.

Online

Summary

This paper addresses the problem of Duplicate Document Detection by using string similarity measures. In contrast to previous methods that used generic or manually tuned distance metrics, in this paper, the authors suggest using learnable (trainable) text distance functions and training a specific function for different data fields. Such specialized function can capture a unique notion of similarity as is relevant for the specific data represented by a specific field.

Two similarity metrics are suggested:

  1. The first extends the String Edit Distance as suggested by Ristad and Yianilos to include affine gaps.
  2. The second metric measures similarity based on unordered bags of words, using an SVM for training.