Apappu writeup on Banko et al.

From Cohen Courses
Jump to navigationJump to search

This is a review of Banko_2007_open_information_extraction_from_the_web by user:Apappu.

  • This paper talks about an Open domain IE system and its comparison with a state of art closed IE system KnowItAll.
  • Authors talk about scalable and computationally efficient way to extracte relations from WEB. In this process, they describe three essential components of their system, namely,
  • Self-Supervised Learner: uses a dependency parser to identify trustworthy relations to label them as positive examples and rest as negative (co-training ?? ). They employ certain heuristics to decide what a trustworthy relation would look like.
  • followed by Single-Pass Extractor: that tags words with POS and filters non-essential phrases (like prepositional). Finally, each candidate tuple is passed on to classifier.
  • then there is a Redundancy based Assessor: which puts the similar tuples into equivalence (normalized) bins based on the arguments and predicates.
  • To estimate the correctness of the facts authors manually looked into the extracted tuples and classified them based on "well-formed"ness.
  • Then, they talk about how to estimate distinct number of facts from this humongous amount of relations. This seems to be little improbable task given that they don't have much information about [co-reference/spelling-variants/metonyms] of the "arguments" and various senses of predicate phrases.
  • Overall, this paper looks interesting in the view of scalability.