KeisukeKamataki writeup of Cohen 2003

From Cohen Courses
Revision as of 10:42, 3 September 2010 by WikiAdmin (talk | contribs) (1 revision)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

This is a review of Cohen_2003_a_comparison_of_string_distance_metrics_for_name_matching_tasks by user:KeisukeKamataki.

Summary: This paper implements and compares several methods of string distance for named entity matching/clustering.

  • Edit-distance like functions:
    • Levenstein distance -> Well-known, unit cost to all edit operations
    • Jaro similarity metrics -> based on the number and order of common characters between two strings
    • Jaro-Winkler metrics -> Jero distance + using the length P of the longest common prefix of two terms
  • Token-based distance functions (regarding two strings as bags of tokens):
    • Jaccoard similarity -> simply the size(|S_and_T|/|S_or_T|) of word sets S and T
    • TF-IDF -> cosine similarity in terms of term frequency and doc frequency
    • Jensen-Shannon -> based on KL divergence
    • SFS-distance -> based on probability distribution of word appearance
  • Hybrid distance function:soft TF-IDF
    • Level Two Distance function -> Also consider the result of secondary distance function
    • soft TF-IDF -> consider Jaro-Winker as a secondary distance

For matching task, SoftTF-IDF,TF-IDF, SFS and Level2 Jaro-Winkler worked well (especially SoftTF-IDF). SoftTF-IDF worked also best for clustering task.

I like: The set of experimental settings seem to be reasonable. It was a nice finding that TF-IDF based methods which are based on simple ideas give us the best performance.