Blei et al, 2002

From Cohen Courses
Revision as of 22:21, 31 October 2010 by PastStudents (talk | contribs)
Jump to navigationJump to search

Citation

Carlson, A., S. Schafer. 2008. Bootstrapping Information Extraction from Semi-structured Web Pages. ECML PKDD '08: Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I, 2008, 195-210, Berlin, Heidelberg.

Online version

Carlson-ECML08

Summary

This paper introduces a novel hierarchical probabilistic model that combines both global and local features in the learning process. They have applied their technique for extracting structured data from webpages. In this problem, word count can be considered as a traditional iid feature (global feature) and word formatting in the web page can also be considered as local features. These local features are call in this paper also as scope limited features.

The intuition behind their technique is to use global features to infer rules about the local features. For example suppose that we know the name of a set of books. Then by looking at webpages of Amazon.com and by searching the name of books we can infer that the position and font of the book title is the same in most the webpages. We can then use these two features (position and font of book title in web pages) to extract new book titles.


This method first requires a set of web pages which are annotated by human. The annotator should decide what schema columns are presenting in the input web pages and should also annotate a very small number of web pages for four or six different websites. Given this training data, program trains four different classifiers (using different types of features) to classify data for each of the annotated fields. Using these trained classifiers, it then extracts data that maximize confidence value of trained classifiers.

To evaluate their method they have used logistic regression classifier as the baseline method. The technique is tested on two different domains: vacation rentals and job sites. They have shown that by annotating 2-5 pages for 4-6 web sites, their technique can achieve an accuracy of 84% on job offer sites and 91% on vacation rental sites.

Related papers