Philgoo Han writeup of Etzioni, Cafarella, Downey, Popescu, Shaked, Soderland, Weld and Yates
From Cohen Courses
Jump to navigationJump to searchThis is a review of Etzioni_2004_methods_for_domain_independent_information_extraction_from_the_web_an_experimental_comparison by user:Ironfoot.
- KnowItAll and RuleLearning, Subclass Extraction, List Extraction for improvement
- KnowItAll
- Bootstrap phase: generic extraction pattern with 'focus' keywords
- Can generic patterns be precise? With many generics would it be scalable?
- Search instantiated patterns with generic search engines
- Asses with PMI
- Bootstrap phase: generic extraction pattern with 'focus' keywords
- Rule Learning
- Finding domain specific pattern
- From text extracted with generic pattern find new pattern applying a window.
- H1: only select multiple context patterns(precision > recall tradeoff)
- H2: select rules with higher 'only if' precision
- Subclass Extraction
- WordNet
- prefix, suffix
- Other rules (A, B and C)
- List Extraction
- Search random subset of high probable instance
- Find html lists in result
- Naive Bayesian Classifyer to classify high probable lists
- Result
- Great improvement in recall (with fixed precision)
- Very high extraction rate with LE
- This method uses great amount of web query.
- Model goes for a whole run on each new focus, can this be used in real time query.