Liuy writeup of hardening soft constraints
This is a review of the paper Cohen_2000_hardening_soft_information_sources by user:Liuy.
When we say a database is soft, we mean it has a lot of inconsistencies and duplications. It tends to be more expensive to reason directly with a soft database than hard database which has a unique and consistent object identifiers.
The paper infers a likely hard database given a soft database.
It everages the record linkage -- deciding which entity description refer to the same object. It also explore a structure indicating possible name co-reference relations.
I like that their approach to a probablistic model : they cast the problem into selection I to minimize the cost function by ignoring items that do not depend on I, and further approximate the last term of the cost function by a linear function.
The paper then shows there is no polynomial time algorithm that minimizes |I(S)| even we assume I_pot is bipartite and each soft reference occurs at most once.
The optimal hardening is NP-hard result serves as the motivation of finding an efficient implementation.
The time complexity results on the greedy method seems quite positive. It incrementally updates priorities over iteration. The union-find data structure seems interesting to me : it effectively keeps the equivalent relations on references.