KeisukeKamataki writeup of Cohen 2003
From Cohen Courses
Jump to navigationJump to searchThis is a review of Cohen_2003_a_comparison_of_string_distance_metrics_for_name_matching_tasks by user:KeisukeKamataki.
Summary: This paper implements and compares several methods of string distance for named entity matching/clustering.
- Edit-distance like functions:
- Levenstein distance -> Well-known, unit cost to all edit operations
- Jaro similarity metrics -> based on the number and order of common characters between two strings
- Jaro-Winkler metrics -> Jero distance + using the length P of the longest common prefix of two terms
- Token-based distance functions (regarding two strings as bags of tokens):
- Jaccoard similarity -> simply the size(|S_and_T|/|S_or_T|) of word sets S and T
- TF-IDF -> cosine similarity in terms of term frequency and doc frequency
- Jensen-Shannon -> based on KL divergence
- SFS-distance -> based on probability distribution of word appearance
- Hybrid distance function:soft TF-IDF
- Level Two Distance function -> Also consider the result of secondary distance function
- soft TF-IDF -> consider Jaro-Winker as a secondary distance
For matching task, SoftTF-IDF,TF-IDF, SFS and Level2 Jaro-Winkler worked well (especially SoftTF-IDF). SoftTF-IDF worked also best for clustering task.
I like: The set of experimental settings seem to be reasonable. It was a nice finding that TF-IDF based methods which are based on simple ideas give us the best performance.