Apappu writeup on Cohen et al. IJCAI03

Edit distance like

Levenshtein -- Dynamic programming based on unit cost Monger-Elkan --- a variant of edit distance

common characters: Number and order based

Jaro metric -- makes note of common subsequence

Jaro-Winkler -- uses the length of P of the longest common prefix

Token based distance functions: jaccard similarity, TFIDF,

then Jennson Shannon distance based on unknown distributions of source and target tokens

variant/simplified Fellegi and Sunter seems to be a nice extension, since it is pretty obvious that disagreement on frequent terms is less important.

two level distance function could be tempting to attempt followed by the paradigm of stacked / two stage learning.

On a side note: KMP (Knuth Morris Pratt) is worth mentioning for its lazy memoization approach. [Knuth et al.'77]

blocking a.k.a branch and bound !

this method reminds me of the suffix tree algorithm .

Navigation menu