Sgardine writesup Cohen 2000 Hardening Soft Information Sources

From Cohen Courses
Jump to navigationJump to search

This is a review of the paper Cohen_2000_hardening_soft_information_sources by user:sgardine.

Summary

Among other things, the web presents a collection of unstructured data which may be viewed as a collection of "soft" databases, where relations are asserted between entities but entities will generally lack a unique persistent identifier. If we were able to construct from these a "hard" database where we knew which references corefer we could send sophisticated queries to the database without regard for where individual relations are asserted. Here we attempt to find an interpretation of the references of the soft databases which minimizes the cost of the interpretation; the cost is constructed so as to tradeoff the weights associated (a priori) with the interpretation and the size of the recovered hard database. Finding the optimal interpretation is NP-complete since a polynomial-time optimizer could solve the vertex cover problem in polynomial time. A greedy algorithm for finding a good interpretation is presented.

Commentary

Where do the weights of the potential interpretation set come from? I guess that's a separate subproblem, and could use some kind of string-similarity metric. It seems like sometimes we may want more e.g. that "Cassius Clay" and "Mohammed Ali" are less unlikely to corefer than their string divergence suggests.

Missing evaluation; evaluation seems ungainly but certainly possible. Run queries with known truth values over a suitable document set and see how well we do.