Philgoo Han writeup of Cohen, Ravikumar and Fienberg

From Cohen Courses
Jump to navigationJump to search

This is a review of Cohen_2003_a_comparison_of_string_distance_metrics_for_name_matching_tasks by user:Ironfoot.

  • Comparison of string distance metrics
    • Open source java toolkit for name-matching
    • Little proir knowledge, ill-structured data
  • Edit-distance like functions
  • Token based distance functions
  • Hybrid distance functions
  • Blocking methods: Not practical to match all pair
  • Results
    • Matching: SoftTFIDF is generally the best
    • Clustering: Token based is good in average but bad when there are many misspellings
    • Combination of distance metrics: better result but training overhead