Nschneid writeup of Cohen 2003

This is Nschneid's review of Cohen_2003_a_comparison_of_string_distance_metrics_for_name_matching_tasks

Recently, we have begun implementing an open-source, Java toolkit of name-matching methods (Cohen & Ravikumar 2003) that includes a variety of different techniques. In this paper we use this toolkit to conduct a comparison of several string distances on the tasks of matching and clustering lists of entity names. We also introduce and evaluate a number of novel string-distance methods. One of these novel distance metrics performs better, on average, than any previous string-distance metric on our benchmark problems. This new distance metric extends cosine similarity by using the Jaro-Winkler method (Winkler 1999) to exploit nearly-matching tokens.

They address two types of similarity measures: edit distance-like functions (Levenshtein and related metrics) and token-based distance functions (TFIDF, Jaccard, Jensen-Shannon, etc.). A new hybrid approach, SoftTFIDF, is presented; it combines TFIDF with an edit distance metric and achieves better performance on the matching and clustering tasks used for the evaluation. Finally, an SVM classifier is trained with features incorporating various types of similarity scores; unsurprisingly, this supervised approach surpasses the unsupervised similarity measures.

Nschneid writeup of Cohen 2003

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools