Mnduong writeup of Cohen et al. IJCAI '03

From Cohen Courses
Jump to navigationJump to search

This is a review of Cohen_2003_a_comparison_of_string_distance_metrics_for_name_matching_tasks by user:mnduong.

Questions I have:

  • In the definition of TFTDF, IDF is defined as the inverse of the fraction of names containing w, which ignores the frequency of w in those names, whereas TF does take frequency into account. Has the other way of computing IDF (as the number of tokens in the corpus, divided by the total number of occurrences of this token in the corpus) been used, and does it make a difference?
  • In the definition of SoftTFIDF, CLOSE(theta, S, T) is defined as the set of words w in S such that there is some v in T such that

dist'(w, v) > theta. Shouldn't dist' here be sim' instead?

  • Also in this definition, D(w, T) is defined as the max of dist(w, v) among v in T. Is this dist a secondary function, related to sim' ?