Difference between revisions of "Cohen 2003 a comparison of string distance metrics for name matching tasks"

From Cohen Courses
Jump to navigationJump to search
Line 8: Line 8:
  
 
== Summary ==
 
== Summary ==
 +
In this [[Category::paper]] the authors compare a variety of string distance metrics on the task of matching entity names. In this work, this task is considered as a classification of entity pairs as ''matching'' or ''non-matching''. The main types of string distance metrics being considered are ''string edit distances'', ''token based distance functions'', and ''hybrid distance functions''. The focus of the paper is on evaluating the behavior of string matching tools for matching problems with little prior knowledge or ill-structured data, and so the selected string distance metrics are useful in these settings.
 +
 +
The best performing method was found to be hybrid combination of TFIDF and the Jaro-Winkler string matching scheme, which on average performs slightly better than either of the separate methods.
 +
 +
== Methods ==
 +
 +
The paper compares the following three main types of string distance metrics.
 +
# '''Edit-distance functions''' map a pair of strings <math>s</math> and <math>t</math>, to a value <math>r</math>, which represents either the distance or the similarity of the two strings. The methods that were considered are:
 +
#:* ''Levenshtein'' distance, which assigns unit cost to all edit operations.
 +
#:* ''Monger-Elkan'' distance. This is an affine version of the Smith-Waterman distance function, in which opening a gap is more costly than continuing a gap.
 +
#:* The ''Jaro'' metric. This is not based on an edit distance model but on the number and order of the common characters between two strings.
 +
#:* ''Jaro-Winkler'', which is a variant of Jaro which also uses the length of the longest common prefix of <math>s</math> and <math>t</math>.
 +
# '''Token-based distance functions''' such as:
 +
#:* The strings <math>s</math> and <math>t</math> were considered as a ''bag-of-words''.
 +
#:* The ''Jaccard similarity'' of the word sets <math>S</math> and <math>T</math>, given by <math>\frac{|S\cap T|}{|S\cup T|}</math>
 +
 +
 +
 +
 +
== Results ==
 +
The evaluation of different metrics on a dataset is done by grouping pairs of names and then ranking them according to distance. Under this evaluation, TFIDF performed best among token-based metrics, and an affine-gap edit-distance performed best among edit-distance metrics. The best hybrid method combines TFIDF and the Jaro-Winkler method by replacing the exact matches used in TFIDF with approximate matches based on Jaro-Winkler.  This was found to be the best performing method with slightly better performance on average than the base methods and with occasional much better performance.

Revision as of 13:02, 29 November 2011

Citation

A comparison of string distance metrics for name-matching tasks, by W. W Cohen, P. Ravikumar, S. E Fienberg. In Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web (IIWeb-03), 2003.

Online

Summary

In this paper the authors compare a variety of string distance metrics on the task of matching entity names. In this work, this task is considered as a classification of entity pairs as matching or non-matching. The main types of string distance metrics being considered are string edit distances, token based distance functions, and hybrid distance functions. The focus of the paper is on evaluating the behavior of string matching tools for matching problems with little prior knowledge or ill-structured data, and so the selected string distance metrics are useful in these settings.

The best performing method was found to be hybrid combination of TFIDF and the Jaro-Winkler string matching scheme, which on average performs slightly better than either of the separate methods.

Methods

The paper compares the following three main types of string distance metrics.

  1. Edit-distance functions map a pair of strings and , to a value , which represents either the distance or the similarity of the two strings. The methods that were considered are:
    • Levenshtein distance, which assigns unit cost to all edit operations.
    • Monger-Elkan distance. This is an affine version of the Smith-Waterman distance function, in which opening a gap is more costly than continuing a gap.
    • The Jaro metric. This is not based on an edit distance model but on the number and order of the common characters between two strings.
    • Jaro-Winkler, which is a variant of Jaro which also uses the length of the longest common prefix of and .
  2. Token-based distance functions such as:
    • The strings and were considered as a bag-of-words.
    • The Jaccard similarity of the word sets and , given by



Results

The evaluation of different metrics on a dataset is done by grouping pairs of names and then ranking them according to distance. Under this evaluation, TFIDF performed best among token-based metrics, and an affine-gap edit-distance performed best among edit-distance metrics. The best hybrid method combines TFIDF and the Jaro-Winkler method by replacing the exact matches used in TFIDF with approximate matches based on Jaro-Winkler. This was found to be the best performing method with slightly better performance on average than the base methods and with occasional much better performance.