Nschneid writeup of Cohen 2003

From Cohen Courses
Jump to navigationJump to search

This is Nschneid's review of Cohen_2003_a_comparison_of_string_distance_metrics_for_name_matching_tasks

Recently, we have begun implementing an open-source, Java toolkit of name-matching methods (Cohen & Ravikumar 2003) that includes a variety of different techniques. In this paper we use this toolkit to conduct a comparison of several string distances on the tasks of matching and clustering lists of entity names. We also introduce and evaluate a number of novel string-distance methods. One of these novel distance metrics performs better, on average, than any previous string-distance metric on our benchmark problems. This new distance metric extends cosine similarity by using the Jaro-Winkler method (Winkler 1999) to exploit nearly-matching tokens.

They address two types of similarity measures: edit distance-like functions (Levenshtein and related metrics) and token-based distance functions (TFIDF, Jaccard, Jensen-Shannon, etc.). A new hybrid approach, SoftTFIDF, is presented; it combines TFIDF with an edit distance metric and achieves better performance on the matching and clustering tasks used for the evaluation. Finally, an SVM classifier is trained with features incorporating various types of similarity scores; unsurprisingly, this supervised approach surpasses the unsupervised similarity measures.

More on the evaluation datasets:

The “coraATDV” dataset includes the fields author, title, date, and venue in a single string. The “census” dataset is a synthetic, census-like dataset, from which only textual fields were used (last name, first name, middle initial, house number, and street).
  • Is there a reason to believe that the results of the comparison would be similar/different for different types of data? It seems that the datasets used for the evaluation were largely comprised of person and place names. Has SoftTFIDF been tested for company names? biological terms? longer units (phrases or sentences)? etc.
  • Would it make sense to adapt SoftTFIDF for names/loan word in a multilingual setting, where a transliteration/phonological model is used for edit distance and TFIDF scores are obtained from monolingual corpora in two languages?
  • Has SoftTFIDF been incorporated in larger tasks/applications such as the one addressed in Cohen 2000?