Sgardine writesup Banko 2007
This is a review of Banko_2007_open_information_extraction_from_the_web by user:Sgardine
Summary
Extracting (a large number of) relations from Web-size corpora presents several challenges to "traditional" methods: describing each relation is labor-intensive, the documents are drawn from multiple distributions and therefore models trained on one distribution do not apply to other distributions, and relations must be specified before they can be learned. TextRunner engages these challenges by using Open IE, whereby the only input is a corpus and any the system attempts to identify any relation specified therein.
The Self-Supervised Learner takes a corpus sample and runs a parser over it; the output of the parser is used as training data to a Naive Bayes model which classifies tuples as to whether they assert the joining relation between the entities. The Single-Pass Extractor uses POS tagging and NP chunking and rule-based simplifications to find candidate tuples with normalized relation descriptions. Counts of each canonical tuple are used to estimate probabilities of the truth value of the tuple by the Redundancy-Based Assessor. The system indexes the tuples extracted for later queries.
The bottleneck seems to be visiting and processing a large number of webpages; TextRunner extracts more information from that process by gathering all tuples presented rather than just the ones it was looking for (as did KnowItAll). Restricting attention to a subset of relations (given for comparison to KnowItAll as input) it is found that TextRunner retrieves as many correct tuples with slightly higher precision, largely because of the learned classifier.
Commentary
I liked the analysis of the system's behavior on a sample of tuples, though it seems like they could have gotten more from this sampling, statistically speaking, e.g. standard deviations for their estimates.
The relation-normalization phase seems like it might be interesting to examine separately: can we estimate how well their heuristic system is doing? Can we learn a model that does better? Can we estimate how much would perfect performance at the subtask would help the later parts of the system?