Sgopal1 writeup of string distances

From Cohen Courses
Jump to navigationJump to search

This is a review of the paper Cohen_2003_a_comparison_of_string_distance_metrics_for_name_matching_tasks by user:sgopal1.


This paper presents a comparative evaluation of several string-matching algorithms. There are two different tasks considered - Matching and Clustering. The different algorithms considered :

  • Levenstein distance, Jaro , Jaro-winkler for letter based matching
  • TF-IDF , Jensen-Shannon ( making KL-divergence symmetric ) , a slightly modified token matching method due to Fellegi and Sunter for token based matching.
  • Two-level distance function due to Monge-Elkan, soft-TFIDF to allow matching nearly similar words.

The following conclusions were made

  • TF-IDF is the best among token based methods
  • There is no clear winner between Monge-Elkan and Jaro based methods.
  • SoftTF-IDF is best overall measure ( even on clustering )

Although the paper does a comprehensive evaluation, I think the strategy of selecting a particular distance measure heavily relies on the domain under consideration. It would be possible for someone to come up with a better distance measure given the domain knowledge ( which I think is most of the times known ).