Liuy writeup of hardening soft constraints

From Cohen Courses
Jump to navigationJump to search

This is a review of the paper Cohen_2000_hardening_soft_information_sources by user:Liuy.

When we say a database is soft, we mean it has a lot of inconsistencies and duplications. It tends to be more expensive to reason directly with a soft database than hard database which has a unique and consistent object identifiers.

The paper infers a likely hard database given a soft database.

It everages the record linkage -- deciding which entity description refer to the same object. It also explore a structure indicating possible name co-reference relations.

I like that their approach to a probablistic model : they cast the problem into selection I to minimize the cost function by ignoring items that do not depend on I, and further approximate the last term of the cost function by a linear function.

The paper then shows there is no polynomial time algorithm that minimizes |I(S)| even we assume I_pot is bipartite and each soft reference occurs at most once.

The optimal hardening is NP-hard result serves as the motivation of finding an efficient implementation.

The time complexity results on the greedy method seems quite positive. It incrementally updates priorities over iteration. The union-find data structure seems interesting to me : it effectively keeps the equivalent relations on references.