Cohen 2003 a comparison of string distance metrics for name matching tasks

From Cohen Courses
Jump to navigationJump to search

Citation

A comparison of string distance metrics for name-matching tasks, by W. W Cohen, P. Ravikumar, S. E Fienberg. In Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web (IIWeb-03), 2003.

Online

Summary

In this paper the authors compare a variety of string distance metrics on the task of matching entity names. In this work, this task is considered as a classification of entity pairs as matching or non-matching. The main types of string distance metrics being considered are string edit distances, token based distance functions, and hybrid distance functions. The focus of the paper is on evaluating the behavior of string matching tools for matching problems with little prior knowledge or ill-structured data, and so the selected string distance metrics are useful in these settings.

The best performing method was found to be hybrid combination of TFIDF and the Jaro-Winkler string matching scheme, which on average performs slightly better than either of the separate methods.

Methods

The paper compares the following metrics, divided into three main types of string distances.

  1. Edit-distance functions map a pair of strings and , to a value , which represents either the distance or the similarity of the two strings. The methods that were considered are:
    • Levenshtein distance, which assigns unit cost to all edit operations.
    • Monge-Elkan distance (Monge and Elkan, 1996). This is an affine version of the Smith-Waterman distance function, in which opening a gap is more costly than continuing a gap.
    • The Jaro metric (Jaro, 1995; Jaro, 1989). This is not based on an edit distance model but on the number and order of the common characters between two strings.
    • Jaro-Winkler (Winkler, 1999), which is a variant of Jaro which also uses the length of the longest common prefix of and .
  2. Token-based distance functions such as:
    • The strings and were considered as a bag-of-words.
    • The Jaccard similarity of the word sets and , given by
    • The widely used, TFIDF.
    • Jensen-Shannon distance (inspired by Degan et al, 1999) between and , based on the assumption that a token set can be viewed as samples from an unknown distribution of tokens.
      The distance is given by:
      where
  3. Hybrid distance functions.
    • level two distance function: Two long strings, and , are compared recursively. First, the strings are broken into substrings and .
      Similarity is then defined as
      is a secondary distance function. The experiments include: Monge-Elkan, Jaro, and Jaro-Winkler.
    • SoftTFIDF: This is a "soft" version of TFIDF, in which similar tokens are considered in addition to tokens in . "Similar" tokens are determined by a secondary distance function, in this case, Jaro-Winkler.

Results

The evaluation of different metrics on a dataset is done by grouping pairs of names and then ranking them according to distance. Under this evaluation, TFIDF performed best among token-based metrics, and an affine-gap edit-distance (Monge-Elkan) performed best among edit-distance metrics. The best hybrid method combines TFIDF and the Jaro-Winkler method by replacing the exact matches used in TFIDF with approximate matches based on Jaro-Winkler (SoftTDIDF). This was found to be the best performing method with slightly better performance on average than the base methods and with occasionaly much better performance.