Suranah writeup for Cohen 2003
This is a review of Cohen_2003_a_comparison_of_string_distance_metrics_for_name_matching_tasks by user:Suranah.
The paper presents a survey of different string similarity on the task of matching matching named entities. One advantage with standard string similarity algorithms is that they do not have to be trained, and can work on a new dataset without any prior knowledge. The paper classifies the algorithms into edit distance like methods which include Jaro-Winkler and SmithWaterman, and into token based distance functions like Jaccard or TFIDF. The paper also uses hybrid measures including one based on TFIDF which is dubbed as SoftTFIDF.
I liked the idea of pruning the search space, though I do wonder that whether there could be better methods than using a priori. For instance, it maybe possible to emulate edit distance and similar algorithms on a character or word level tree, and thus pruning the search space while matching. It is possible that in this approach we have to match still lesser entities (as the search is in a tree rather linear) and yet maintain similar results.
The paper presents also presents a wide range of experiments on 13 datasets for two tasks (matching and clustering), and some decent an error analysis. SoftTFIDF was the best measure for both matching and clustering problems. It is notable that only token based measures perform poorly for clustering tasks, as the set has census data, which chiefly comprises of misspellings.