Apappu writeup on Cohen et al. IJCAI03
This is a review of Cohen_2003_a_comparison_of_string_distance_metrics_for_name_matching_tasks by user:Apappu.
Edit distance like
Levenshtein -- Dynamic programming based on unit cost Monger-Elkan --- a variant of edit distance
common characters: Number and order based
Jaro metric -- makes note of common subsequence
Jaro-Winkler -- uses the length of P of the longest common prefix
Token based distance functions: jaccard similarity, TFIDF,
then Jennson Shannon distance based on unknown distributions of source and target tokens
variant/simplified Fellegi and Sunter seems to be a nice extension, since it is pretty obvious that disagreement on frequent terms is less important.
two level distance function could be tempting to attempt followed by the paradigm of stacked / two stage learning.
- On a side note: KMP (Knuth Morris Pratt) is worth mentioning for its lazy memoization approach. [Knuth et al.'77]
blocking a.k.a branch and bound !
this method reminds me of the suffix tree algorithm .