Liuy writeup of Banko et al.

From Cohen Courses
Revision as of 10:42, 3 September 2010 by WikiAdmin (talk | contribs) (1 revision)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

A review of Banko_2007_open_information_extraction_from_the_web by user:Liuy

The paper deals with the problem that users will have to generate another set of extraction rules or name the relations, once changed to a brand new domain (These have to be done manually). They suggests an unsupervised extraction method to solve this problem, by which relations are extracted and recorded, almost automatically.

It is an interactive system in the sense that it allows users to explore relations and the tuples are assigned a probability and index. It first construct the training set (label it) by a parser and then run Naive Bayes classifier for extraction. The parser traverses the structure : label a set of tuples and connect it to a feature vector. Then these labeled feature vectors will be used in running Naive Bayesian classifier.

This paper also describes TEXTRUNNER. It is an open IE from the web matches up to the recall achieved by KNOWITALL. It is highly scalable and can extract vast amount of information from large web corpus. TEXTTUNNER seems have lower error rate than KNOWITALL, and can find similar number of correct extractions. But it still suffers a lot from noun phrase analysis, with adding of stray words and shortening of some arguments.

I like the work, but I think it can bettered by putting the unify the whole thing into a graph structure, as graph is a better way to describe and mine relations.

I have the following concerns on the statistics they reported on the TEXTTRUNNER. In estimating the correctness of the facts, they take a selection of 400 tuples in random from the filtered set. However they do not mention how exactly the filtering is done. The concrete or abstract judgement is rather subjective. As a result, the true of false judgement on top of is dubious.

In estimating the number of distinct facts, they find out it is very hard to decide how many relation strings are synonymous, since many of the relations have multi-facet meanings. Two relations maybe synonymous in one sense, but not in another. So they reduce the problem into similarity at the tuple level. Though the reduction make the problem a lot easier, it also trivialize the problem and results in an overestimation of the number of unique facts.