Difference between revisions of "Blei et al, 2002"

From Cohen Courses
Jump to navigationJump to search
Line 16: Line 16:
 
They have described both generative and discriminative approaches for classification and extraction tasks. Global features are govern by parameters that are shared by all data and local features are shared only by a subset of data. For example in information extraction task, all the words in a webpage (without considering formatting) can be considered as global features. On the other hand, features such as position of a text box or color of a text are local features.  
 
They have described both generative and discriminative approaches for classification and extraction tasks. Global features are govern by parameters that are shared by all data and local features are shared only by a subset of data. For example in information extraction task, all the words in a webpage (without considering formatting) can be considered as global features. On the other hand, features such as position of a text box or color of a text are local features.  
  
They have tested their method on two different datasets. The first dataset contains 1000 HTML documents. Each document is automatically divided into a set of words with similar layout characteristics and then are hand-labeled as containing or not containing a job title. The local and global features for this domain are the same as what we discussed above. The second dataset contain 42,548 web pages from 330 web sites which each web page is hand-labeled as if it is a press release or not press release. The global feature is a set of word in each webpage and local feature is the URL of the webpage.  
+
They have tested their method on two different datasets. The first dataset contains 1000 HTML documents. Each document is automatically divided into a set of words with similar layout characteristics and then are hand-labeled as containing or not containing a job title. The local and global features for this domain are the same as what we discussed above. The second dataset contain 42,548 web pages from 330 web sites which each web page is hand-labeled as if it is a press release or not press release. The global feature is a set of word in each webpage and local feature is the URL of the webpage. Their experimental result have shown that this approach can obtain high precision and low/moderate recall.  
  
 
== Related papers ==
 
== Related papers ==

Revision as of 22:56, 31 October 2010

Citation

Carlson, A., S. Schafer. 2008. Bootstrapping Information Extraction from Semi-structured Web Pages. ECML PKDD '08: Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I, 2008, 195-210, Berlin, Heidelberg.

Online version

Carlson-ECML08

Summary

This paper introduces a novel hierarchical probabilistic model that combines both global and local features in the learning process. They have applied their technique for extracting structured data from webpages. In this problem, word count can be considered as a traditional iid feature (global feature) and word formatting in the web page can also be considered as local features. These local features are call in this paper also as scope limited features.

The intuition behind their technique is to use global features to infer rules about the local features. For example suppose that we know the name of a set of books. Then by looking at webpages of Amazon.com and by searching the name of books we can infer that the position and font of the book title is the same in most the webpages. We can then use these two features (position and font of book title in web pages) to extract new book titles.

They have described both generative and discriminative approaches for classification and extraction tasks. Global features are govern by parameters that are shared by all data and local features are shared only by a subset of data. For example in information extraction task, all the words in a webpage (without considering formatting) can be considered as global features. On the other hand, features such as position of a text box or color of a text are local features.

They have tested their method on two different datasets. The first dataset contains 1000 HTML documents. Each document is automatically divided into a set of words with similar layout characteristics and then are hand-labeled as containing or not containing a job title. The local and global features for this domain are the same as what we discussed above. The second dataset contain 42,548 web pages from 330 web sites which each web page is hand-labeled as if it is a press release or not press release. The global feature is a set of word in each webpage and local feature is the URL of the webpage. Their experimental result have shown that this approach can obtain high precision and low/moderate recall.

Related papers