Sgopal1 writeup open Information Extraction from Web

From Cohen Courses
Jump to navigationJump to search

This is a review of Banko_2007_open_information_extraction_from_the_web by user:sgopal1.

This paper introduces TextRunner - A scalable,domain independent way to extract information. It consists of three parts

  • Self-supervised learner: Given a small corpus, it labels its own positive and negative instances on the corpus based on some heuristics using the parse of the sentence etc. It maps a subsequence of the words from the sentence to some feature vector, and labels the feature vector as either positive or negative. A NBC is trained on this self-labeled dataset.
  • Single pass extractor: Another heuristic to extract subsequences between two noun-phrases is employed. Non-essential words in the subsequence is eliminated. This is then classified as positive/negative by the NBC previously trained.
  • Query processing: It is possible to run queries and view the results in a faster way.
  • I do not understand the self-labeling part. It seems a little vague as to what exactly they perform.
  • In the Query processing module, each relation is allocated to a particular machine. I dont think that this is the way a general assignment, because a given query will be computed only a particular machine and this is ( although I'm not 100% sure ) considered bad - there is no parallelism for a given query. This module is not very impressive.