Sgopal1 writeup open Information Extraction from Web

This paper introduces TextRunner - A scalable,domain independent way to extract information. It consists of three parts

Self-supervised learner: Given a small corpus, it labels its own positive and negative instances on the corpus based on some heuristics using the parse of the sentence etc. It maps a subsequence of the words from the sentence to some feature vector, and labels the feature vector as either positive or negative. A NBC is trained on this self-labeled dataset.
Single pass extractor: Another heuristic to extract subsequences between two noun-phrases is employed. Non-essential words in the subsequence is eliminated. This is then classified as positive/negative by the NBC previously trained.
Query processing: It is possible to run queries and view the results in a faster way.

I do not understand the self-labeling part. It seems a little vague as to what exactly they perform.
In the Query processing module, each relation is allocated to a particular machine. I dont think that this is the way a general assignment, because a given query will be computed only a particular machine and this is ( although I'm not 100% sure ) considered bad - there is no parallelism for a given query. This module is not very impressive.

Navigation menu