Bbd writeup of comparison of string distance metrics

From Cohen Courses
Jump to navigationJump to search

This is a review of Cohen_2003_a_comparison_of_string_distance_metrics_for_name_matching_tasks by user:Bbd.

This paper presents experimental comparison of various string distance metrics. They are evaluated on matching and clustering tasks. Following methods were considered :

  • Edit-distance like functions - Levenstein and Monge-Elkan
  • Token based distance function - Jaccard similarity, Jenson-shannon, SFS distance
  • Hybrid distance functions - soft TFIDF

During the experimental study they found that

  • TFDF performs best among several token-based metrics
  • Monge-elkan is best among string edit-distance metrices

and combination of TFIDF and jaro-Winkler performs better than either of them.