Liuy writesup Etzioni 2004

From Cohen Courses
Jump to navigationJump to search

This is a review of etzioni_2004_methods_for_domain_independent_information_extraction_from_the_web_an_experimental_comparison by user:Liuy.

This paper introduces KNOWITALL system, that extracted large collections of facts (50, 000 facts) with reasonable precision. They discuss their techniques in KNOWITALL: rule learning, subclass extraction, and list extraction. They show how improvement can be achieved these techniques, in terms of recall and extraction rate, while keeping the same precision. They include Nutch open source search engine into KNOWITALL, because they find out the speed at which KNOWITALL release queries is the main factor that influence its performance. However, adding RL, SE and LE does not affect the KNOWITALL's linear computational complexity. I think PMI assessment of extracted instances can be bettered by using some machine learning techniques, for example, co-training. This paper's main contribution is showing a possible way to expand the scale of IE, collecting a large number of facts to support AI systems.

Rule Learning deals with learning domain-specific rules, motivated by the fact that many of the best domain-specific extraction rules do not match a generic pattern. It begins with seed instances produced by bootstrapping. Then search engine queries are released for each seed instance. A context string is recorded for each page returned by the search engine. The best substring of the best context strings that are able to extract new class instances with high accuracy, are transformed into extraction rules and incorporated into the system.

Subclass Extraction inputs class of interest to the Extractor and let it instantiate the generic patterns. Extracting subclasses is similar to that of class instances, except that the rules are changed somehow to decide whether the extracted noun is a common noun or not. It further ranks the candidate subclasses by probabilty. An extraction-and-verification step is added to improve the recall for subclasses.

List extraction explores regularly-formatted lists in the web. An assessment algorithm is proposed based on the intuition that the more valid instances are in a list, the higher the list accuracy. A number of approaches are tried: Naive Bayesian classifier, EM, or simple sorting of the instances to by the number of lists they show up.


I have some questions regarding their experiments :

1. The three classes city, scientist, and film, seems to me highly biased and include mostly information with regularity. I am not sure the classes they use here can represent the general performance of their system.

2. As they point out, search engine queries are the bottle neck, measure extraction rate by the number of unique instances extracted per search engine query.

3. They claim they run the experiments using the best parameter setting. But it is not clear how they get this best parameter setting, possibly on a validation set? Then question is, how they sample to get this validation set. Is it highly biased?