Bootstrapping Information Extraction from Semi-structured Web Pages
Citation
Carlson, A., S. Schafer. 2008. Bootstrapping Information Extraction from Semi-structured Web Pages. ECML PKDD '08: Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I, 2008, 195-210, Berlin, Heidelberg.
Online version
[www.cs.cmu.edu/~acarlson/papers/carlson-ecml08.pdf Carlson-ECML08]
Summary
Extracting structured records from semi structured web pages is an interesting problem which has been recently studied in the field of machine learning and information extraction. Many of the developed techniques require significant human effort to annotate data for each website or they require a heuristic to be able to extract data types that exist in the webpage. This paper introduces a novel approach to extract semi structured data from web pages by requiring annotating only a few pages for very few websites.
This method first requires a set of web pages which are annotated by human. The annotator should decide what schema columns are interesting are presenting in the input web pages and should also annotates a very small number of web pages for four or six websites. Given this training data, program trains four different classifier (using different types of features) to classify data for each of the annotated fields. Using these trained classifiers, it then tries to extract data that maximize confidence value of trained classifiers.
To evaluate their method they have used regularized logistic regression classifier as the baseline method. The technique is tested on two different vacation rentals and job sites. They have shown that by annotating 2-5 pages for 4-6 web sites their technique can achieve an accuracy of 84% on job offer sites and 91% on vacation rental sites.