Sgopal1 writeup of string distances

This is a review of the paper Cohen_2003_a_comparison_of_string_distance_metrics_for_name_matching_tasks by user:sgopal1.

This paper presents a comparative evaluation of several string-matching algorithms. There are two different tasks considered - Matching and Clustering. The different algorithms considered :

Levenstein distance, Jaro , Jaro-winkler for letter based matching
TF-IDF , Jensen-Shannon ( making KL-divergence symmetric ) , a slightly modified token matching method due to Fellegi and Sunter for token based matching.
Two-level distance function due to Monge-Elkan, soft-TFIDF to allow matching nearly similar words.

The following conclusions were made

TF-IDF is the best among token based methods
There is no clear winner between Monge-Elkan and Jaro based methods.
SoftTF-IDF is best overall measure ( even on clustering )

Although the paper does a comprehensive evaluation, I think the strategy of selecting a particular distance measure heavily relies on the domain under consideration. It would be possible for someone to come up with a better distance measure given the domain knowledge ( which I think is most of the times known ).

Sgopal1 writeup of string distances

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools