Sgardine writesup Cohen 2000 Hardening Soft Information Sources
This is a review of the paper Cohen_2000_hardening_soft_information_sources by user:sgardine.
Summary
Among other things, the web presents a collection of unstructured data which may be viewed as a collection of "soft" databases, where relations are asserted between entities but entities will generally lack a unique persistent identifier. If we were able to construct from these a "hard" database where we knew which references corefer we could send sophisticated queries to the database without regard for where individual relations are asserted. Here we attempt to find an interpretation of the references of the soft databases which minimizes the cost of the interpretation; the cost is constructed so as to tradeoff the weights associated (a priori) with the interpretation and the size of the recovered hard database. Finding the optimal interpretation is NP-complete since a polynomial-time optimizer could solve the vertex cover problem in polynomial time. A greedy algorithm for finding a good interpretation is presented.
Commentary
Where do the weights of the potential interpretation set come from? I guess that's a separate subproblem, and could use some kind of string-similarity metric. It seems like sometimes we may want more e.g. that "Cassius Clay" and "Mohammed Ali" are less unlikely to corefer than their string divergence suggests.
Missing evaluation; evaluation seems ungainly but certainly possible. Run queries with known truth values over a suitable document set and see how well we do.