Liuy writeup of string distances

This is a review of the paper Cohen_2003_a_comparison_of_string_distance_metrics_for_name_matching_tasks by user:Liuy.

The paper studies a variety of technique and the combination technique :

the two edit-distance functions.

the token-based distance functions and the recursive matching scheme for comparing two long strings. Results for matching and clustering based on these metrics are also given.

I am particularly interested in the way of combining the distance metrics. The paper mentioned that training svm using the various scores (Monge-Elkan, Jaro-Winkler, TFIDF,SFS,andSoftTFIDF) as features. The learned metric uses much more information: trained on several thousand labeled pairs, whereas the other metrics considered require no training data. I think it can be casted as feature selection problem. Instead of using all the features, a sparse set of features can be selected by lasso types of approach, i.e. hybriding features regularized by sparsity constraints.

I have a question on the part that cobmbing the TFIDF method and the Jaro-Winkler. The paper replaces the exact token matched used in TFIDF with approximate token matches in Jaro-Winkler, and empirical performs slightly better than each individual without combining. There might be some interesting reasons for this to happen.

Liuy writeup of string distances

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools