Difference between revisions of "Sgopal1 writeup of string distances"

Latest revision as of 10:42, 3 September 2010

This is a review of the paper Cohen_2003_a_comparison_of_string_distance_metrics_for_name_matching_tasks by user:sgopal1.

This paper presents a comparative evaluation of several string-matching algorithms. There are two different tasks considered - Matching and Clustering. The different algorithms considered :

Levenstein distance, Jaro , Jaro-winkler for letter based matching
TF-IDF , Jensen-Shannon ( making KL-divergence symmetric ) , a slightly modified token matching method due to Fellegi and Sunter for token based matching.
Two-level distance function due to Monge-Elkan, soft-TFIDF to allow matching nearly similar words.

The following conclusions were made

TF-IDF is the best among token based methods
There is no clear winner between Monge-Elkan and Jaro based methods.
SoftTF-IDF is best overall measure ( even on clustering )

Although the paper does a comprehensive evaluation, I think the strategy of selecting a particular distance measure heavily relies on the domain under consideration. It would be possible for someone to come up with a better distance measure given the domain knowledge ( which I think is most of the times known ).

Difference between revisions of "Sgopal1 writeup of string distances"

Latest revision as of 10:42, 3 September 2010

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools

Revision as of 01:48, 21 October 2009 (view source) Sgopal1 (talk \| contribs) (Created page with 'This is a review of the paper reviewed paper::Cohen_2003_a_comparison_of_string_distance_metrics_for_name_matching_tasks by reviewer::user:sgopal1. This paper presents …')	Latest revision as of 10:42, 3 September 2010 (view source) WikiAdmin (talk \| contribs) m (1 revision)
(No difference)