KeisukeKamataki writeup of Banko 2007

From Cohen Courses
Jump to navigationJump to search

This is a review of Banko_2007_open_information_extraction_from_the_web by user:KeisukeKamataki.

Summary: This paper introduces an open IE system called TEXTRUNNER which extracts facts from web like KNOWITALL. The system consists of 3 key components diferent from KNOWITALL.

Self-supervised learner: This component automatically labels each tuple t=(e_i, r_i, e_j) of the training data if it is "trust-worth" or not. This is achieved by the heuristics based analysis of syntactic structure shared by extracted noun phrases and training Naive Bayes with the information.

Single-Pass Extractor: This component assign POS tags and find relations by examining texts between the noun phrases. After that, they store the candidate tuple information if it is labeled as "trustworth" with the trained with NaiveBayes classifier.

Redundancy-based Assessor: This component calculates the probability of that the tuple t is corrected according to the count of sentences which the pattern occurs. The role of Assessor here is pretty similar with the Assessor of KNOWITALL.

Other things to note: Since this method doesn't use search engine to extract relations, it is much faster than KNOWITALL. The paper also tries to estimate the correctness of the extracted facts. According to their analysis, 14% of extracted relations are "concrete" whose tuple is grounded in particular entities like t=(Telsa, invented, coil transformer), which could be useful for question answering or information extraction.

I like: As well as KNOWITALL, they are trying an interesting/challenging problem. Their modulated learning process sounds reasonable since it helps us test combining multiple methods.

I didn't well understand: Evaluation criteria was unclear for me. If they could be more clear about how they they used the data for evaluation, it would be also helpful for other researchers.