Sgopal1 writeup of hardening soft constraints
From Cohen Courses
Revision as of 02:18, 21 October 2009 by Sgopal1 (talk | contribs) (Created page with 'This is a review of the paper reviewed paper::Cohen_2000_hardening_soft_information_sources by reviewer::user:sgopal1. This paper focuses on removing duplicate entries o…')
This is a review of the paper Cohen_2000_hardening_soft_information_sources by user:sgopal1.
This paper focuses on removing duplicate entries of records in a database, where duplication is defined in a more softer sense than exact match. The problem is formulated in a graph theoretical way. An optimization problem is setup which is shown to be NP-Hard. A probabilistic model is proposed and certain assumptions of the data generation is made. A greedy algorithm which adds interpretation arcs at each time-step is then proposed as a solution to the optimization problem. Some speed-ups is obtained by using union-find methods.
Questions
- Is'nt there a relation between |I| , |S| and |H| , that |H| = |S| - |I| ? That for each arc I add to the interpretation 'I', I remove a possible candidate for |H| ? If this is correct, then we would be able to re-write the optimization as maximizing the sum of weights ( which is a standard problem and can be solved by duplicating the vertices and running a max-flow )
- Note sure about the reason for the acyclic property ? I think it might be possible to reformulate it as a min k-way cut ? Each component of the cut maps all the nodes ( soft-tuples ) to a single hard-tuple ( any random tuple from the component ). This is also seems intuitive to me.
Criticism
- No approximation factor ? I think generally any greedy solution in 'compsci theory' should report an approximation factor, which is completely missing.
- No evaluation. Probably there was no other significant method to compare with, but still some results would've been better.