Difference between revisions of "Sgopal1 writeup of string distances"
From Cohen Courses
Jump to navigationJump to search (Created page with 'This is a review of the paper reviewed paper::Cohen_2003_a_comparison_of_string_distance_metrics_for_name_matching_tasks by reviewer::user:sgopal1. This paper presents …') |
m (1 revision) |
(No difference)
|
Latest revision as of 10:42, 3 September 2010
This is a review of the paper Cohen_2003_a_comparison_of_string_distance_metrics_for_name_matching_tasks by user:sgopal1.
This paper presents a comparative evaluation of several string-matching algorithms. There are two different tasks considered - Matching and Clustering. The different algorithms considered :
- Levenstein distance, Jaro , Jaro-winkler for letter based matching
- TF-IDF , Jensen-Shannon ( making KL-divergence symmetric ) , a slightly modified token matching method due to Fellegi and Sunter for token based matching.
- Two-level distance function due to Monge-Elkan, soft-TFIDF to allow matching nearly similar words.
The following conclusions were made
- TF-IDF is the best among token based methods
- There is no clear winner between Monge-Elkan and Jaro based methods.
- SoftTF-IDF is best overall measure ( even on clustering )
Although the paper does a comprehensive evaluation, I think the strategy of selecting a particular distance measure heavily relies on the domain under consideration. It would be possible for someone to come up with a better distance measure given the domain knowledge ( which I think is most of the times known ).