Apappu writeup on Cohen et al. IJCAI03

From Cohen Courses
Jump to navigationJump to search

This is a review of Cohen_2003_a_comparison_of_string_distance_metrics_for_name_matching_tasks by user:Apappu.

Edit distance like

Levenshtein -- Dynamic programming based on unit cost Monger-Elkan --- a variant of edit distance


common characters: Number and order based

Jaro metric -- makes note of common subsequence

Jaro-Winkler -- uses the length of P of the longest common prefix

Token based distance functions: jaccard similarity, TFIDF,

then Jennson Shannon distance based on unknown distributions of source and target tokens

variant/simplified Fellegi and Sunter seems to be a nice extension, since it is pretty obvious that disagreement on frequent terms is less important.

two level distance function could be tempting to attempt followed by the paradigm of stacked / two stage learning.

  • On a side note: KMP (Knuth Morris Pratt) is worth mentioning for its lazy memoization approach. [Knuth et al.'77]

blocking a.k.a branch and bound !

this method reminds me of the suffix tree algorithm .