Sgopal1 writeup of hardening soft constraints

From Cohen Courses
Jump to navigationJump to search

This is a review of the paper Cohen_2000_hardening_soft_information_sources by user:sgopal1.

This paper focuses on removing duplicate entries of records in a database, where duplication is defined in a more softer sense than exact match. The problem is formulated in a graph theoretical way. An optimization problem is setup which is shown to be NP-Hard. A probabilistic model is proposed and certain assumptions of the data generation is made. A greedy algorithm which adds interpretation arcs at each time-step is then proposed as a solution to the optimization problem. Some speed-ups is obtained by using union-find methods.

Questions


  • Is'nt there a relation between |I| , |S| and |H| , that |H| = |S| - |I| ? That for each arc I add to the interpretation 'I', I remove a possible candidate for |H| ? If this is correct, then we would be able to re-write the optimization as maximizing the sum of weights ( which is a standard problem and can be solved by duplicating the vertices and running a max-flow )
  • Note sure about the reason for the acyclic property ? I think it might be possible to reformulate it as a min k-way cut ? Each component of the cut maps all the nodes ( soft-tuples ) to a single hard-tuple ( any random tuple from the component ). This is also seems intuitive to me.


Criticism


  • No approximation factor ? I think generally any greedy solution in 'compsci theory' should report an approximation factor, which is completely missing.
  • No evaluation. Probably there was no other significant method to compare with, but still some results would've been better.