Difference between revisions of "Blei et al, 2002"

From Cohen Courses
Jump to navigationJump to search
(Created page with '== Citation == Carlson, A., S. Schafer. 2008. Bootstrapping Information Extraction from Semi-structured Web Pages. ECML PKDD '08: Proceedings of the 2008 European Conference on …')
 
 
(15 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
== Citation ==
 
== Citation ==
  
Carlson, A., S. Schafer. 2008. Bootstrapping Information Extraction from Semi-structured Web Pages. ECML PKDD '08: Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I, 2008, 195-210, Berlin, Heidelberg.
+
Blei, D., Bagnell, J., & McCallum, A. ͑2002͒. Learning with
 +
scope, with application to information extraction and
 +
classification. In Proceedings of the 2002 Conference
 +
on Uncertainty in Artificial Intelligence.
  
 
== Online version ==
 
== Online version ==
  
[[www.cs.cmu.edu/~acarlson/papers/carlson-ecml08.pdf|Carlson-ECML08]]
+
[[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.19.1015&rep=rep1&type=pdf|Scope_learning]]
  
 
== Summary ==
 
== Summary ==
  
This [[Category::paper]] introduces a novel approach for [[AddressesProblem::extracting data from semi-structured web pages]] by requiring annotating only a few pages of very few websites.
+
This [[Category::paper]] introduces a novel hierarchical probabilistic model that combines both global and local features in the learning process. They have applied their technique for [[AddressesProblem::extracting structured data from webpages]]. In this problem, word count can be considered as a traditional iid feature (i.e. global feature) and word formatting in the web page can also be considered as local features. These local features are called in this paper also as scope limited features.
  
This method first requires a set of web pages which are annotated by human. The annotator should decide what schema columns are presenting in the input web pages and should also annotate a very small number of web pages for four or six different websites. Given this training data, program trains four different classifiers (using different types of features) to classify data for each of the annotated fields. Using these trained classifiers, it then extracts data that maximize confidence value of trained classifiers.
+
The intuition behind their technique is to use global features to infer rules about the local features. For example suppose that we know the name of a set of books. Then by looking at webpages of Amazon.com and by searching the name of the books that we already have we can infer the position and font of the book title. We can then use these two features (position and font of book title in web pages) to extract new book titles from other web pages.  
  
To evaluate their method they have used [[UsesMethod:: logistic regression]] classifier as the baseline method. The technique is tested on two different domains: vacation rentals and job sites. They have shown that by annotating 2-5 pages for 4-6 web sites, their technique can achieve an accuracy of 84% on job offer sites and 91% on vacation rental sites.
+
They have described both generative and discriminative approaches for classification and extraction tasks. Global features are governed by the parameters that are shared by all the data and local features are shared only by a subset of data. For example in information extraction task, all the words in a webpage (without considering formatting) can be considered as global features. On the other hand, features such as position of a text or color of text are local features.  
  
== Related papers ==
+
In generative model they have modeled each document by introducing a random variable that governs local features. The parameters of the model are:
 +
 
 +
- N words of documents are shown by <math> w=\{w_1,w_2,...,w_N\}</math>
 +
 
 +
- Formatting features are shown by <math> f=\{f_1,f_2,...,f_N\} </math>
 +
 
 +
- Class labels are shown by <math> c=\{c_1,c_2,...,c_N\} </math>
 +
 
 +
The model can be shown by the following joint distribution over local parameters, class labels, words, and formatting features:
 +
 
 +
<math> p(\phi,c,w,f)=p(\phi)\prod_{i=1}^N p(c_n)p(w_n|c_n)p(f_n|c_n,\phi)</math>
 +
 
 +
The parameters are estimated using maximum likelihood estimation on a set of training documents. For inference, one approach is to approximate parameter <math> \phi </math> with a point estimation <math> \hat{\phi} </math> and infer the class label using MAP estimation. We can label each pair by the following formula:
 +
 
 +
<math> \hat{c_n}=argmax_{c_n}p(w_n|c_n)p(f_n|c_n,\hat{\phi})p(c_n) </math>
 +
 
 +
<math> \hat{\phi} </math> can be approximated by <math> \hat{\phi}=argmax_{\phi}p(\phi|f,w) </math>. They have used EM algorithm to maximize the expected log likelihood of formatting features.
 +
 
 +
They have tested their method on two different datasets. The first dataset contains 1000 HTML documents. Each document is automatically divided into a set of words with similar layout characteristics and then are hand-labeled as containing or not containing a job title. The local and global features for this domain are the same as what we explained above. The second dataset contains 42,548 web pages from 330 web sites which each web page is hand-labeled as if it is a press release or not press release. The global feature is a set of word in each webpage and local feature is the URL of the webpage. Their experimental result have shown that this approach can obtain high precision and low/moderate recall.

Latest revision as of 15:55, 24 November 2010

Citation

Blei, D., Bagnell, J., & McCallum, A. ͑2002͒. Learning with scope, with application to information extraction and classification. In Proceedings of the 2002 Conference on Uncertainty in Artificial Intelligence.

Online version

[[1]]

Summary

This paper introduces a novel hierarchical probabilistic model that combines both global and local features in the learning process. They have applied their technique for extracting structured data from webpages. In this problem, word count can be considered as a traditional iid feature (i.e. global feature) and word formatting in the web page can also be considered as local features. These local features are called in this paper also as scope limited features.

The intuition behind their technique is to use global features to infer rules about the local features. For example suppose that we know the name of a set of books. Then by looking at webpages of Amazon.com and by searching the name of the books that we already have we can infer the position and font of the book title. We can then use these two features (position and font of book title in web pages) to extract new book titles from other web pages.

They have described both generative and discriminative approaches for classification and extraction tasks. Global features are governed by the parameters that are shared by all the data and local features are shared only by a subset of data. For example in information extraction task, all the words in a webpage (without considering formatting) can be considered as global features. On the other hand, features such as position of a text or color of text are local features.

In generative model they have modeled each document by introducing a random variable that governs local features. The parameters of the model are:

- N words of documents are shown by

- Formatting features are shown by

- Class labels are shown by

The model can be shown by the following joint distribution over local parameters, class labels, words, and formatting features:

The parameters are estimated using maximum likelihood estimation on a set of training documents. For inference, one approach is to approximate parameter with a point estimation and infer the class label using MAP estimation. We can label each pair by the following formula:

can be approximated by . They have used EM algorithm to maximize the expected log likelihood of formatting features.

They have tested their method on two different datasets. The first dataset contains 1000 HTML documents. Each document is automatically divided into a set of words with similar layout characteristics and then are hand-labeled as containing or not containing a job title. The local and global features for this domain are the same as what we explained above. The second dataset contains 42,548 web pages from 330 web sites which each web page is hand-labeled as if it is a press release or not press release. The global feature is a set of word in each webpage and local feature is the URL of the webpage. Their experimental result have shown that this approach can obtain high precision and low/moderate recall.