Sgopal1 writeup of string distances
From Cohen Courses
Revision as of 01:48, 21 October 2009 by Sgopal1 (talk | contribs) (Created page with 'This is a review of the paper reviewed paper::Cohen_2003_a_comparison_of_string_distance_metrics_for_name_matching_tasks by reviewer::user:sgopal1. This paper presents …')
This is a review of the paper Cohen_2003_a_comparison_of_string_distance_metrics_for_name_matching_tasks by user:sgopal1.
This paper presents a comparative evaluation of several string-matching algorithms. There are two different tasks considered - Matching and Clustering. The different algorithms considered :
- Levenstein distance, Jaro , Jaro-winkler for letter based matching
- TF-IDF , Jensen-Shannon ( making KL-divergence symmetric ) , a slightly modified token matching method due to Fellegi and Sunter for token based matching.
- Two-level distance function due to Monge-Elkan, soft-TFIDF to allow matching nearly similar words.
The following conclusions were made
- TF-IDF is the best among token based methods
- There is no clear winner between Monge-Elkan and Jaro based methods.
- SoftTF-IDF is best overall measure ( even on clustering )
Although the paper does a comprehensive evaluation, I think the strategy of selecting a particular distance measure heavily relies on the domain under consideration. It would be possible for someone to come up with a better distance measure given the domain knowledge ( which I think is most of the times known ).