Liuy writeup of string distances
This is a review of the paper Cohen_2003_a_comparison_of_string_distance_metrics_for_name_matching_tasks by user:Liuy.
The paper studies a variety of technique and the combination technique :
the two edit-distance functions.
the token-based distance functions and the recursive matching scheme for comparing two long strings. Results for matching and clustering based on these metrics are also given.
I am particularly interested in the way of combining the distance metrics. The paper mentioned that training svm using the various scores (Monge-Elkan, Jaro-Winkler, TFIDF,SFS,andSoftTFIDF) as features. The learned metric uses much more information: trained on several thousand labeled pairs, whereas the other metrics considered require no training data. I think it can be casted as feature selection problem. Instead of using all the features, a sparse set of features can be selected by lasso types of approach, i.e. hybriding features regularized by sparsity constraints.
I have a question on the part that cobmbing the TFIDF method and the Jaro-Winkler. The paper replaces the exact token matched used in TFIDF with approximate token matches in Jaro-Winkler, and empirical performs slightly better than each individual without combining. There might be some interesting reasons for this to happen.