Wka writeup of Cohen and Carvalho 2005

From Cohen Courses
Revision as of 11:42, 3 September 2010 by WikiAdmin (talk | contribs) (1 revision)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

This is a review of cohen_2000_hardening_soft_information_sources by user:wka.

In a soft db distinct identifiers may refer to same entity. To hardening the db is to determine which pairs of identifiers refer to same real-world objects. The paper casts hardening as an optimization problem: that of minimizing sum of number of hard tuples in db + cost of co-reference assumption. Finding an optimal hardening is NP-hard, but a greedy algorithm is presented that gives a good hardening in almost linear time. The cost objective function to minimize is derived probabilistically.

  • It would have been good to include some experimental results (on, is there a standard dataset for this kind of problem?).