Difference between revisions of "Cohen and Carvalho, 2005"

Latest revision as of 17:33, 1 February 2011

Citation

Cohen, W. W., & Carvalho, V. (2005). Stacked sequential learning. In Proceedings of the international joint conference on artificial intelligence (IJCAI).

Online version

[[1]]

Summary

This paper introduces a novel schema for stacked sequential learning. Stacked sequential learning is a meta-learning technique that takes an arbitrary base learner and improves its performance by making it aware of nearby examples. In the experiments they have shown that stacked sequential technique improves performance of both sequential (e.g. conditional random field) and non-sequential algorithms. This technique can be applied on any base learner and imposes only a constant overhead in training time.

This technique is evaluated on the problem of recognizing the signature section of an email. The dataset contains 617 emails. Each email is represented by a vector of feature vectors. Each element of this vector is a feature vector for each line of the email. A line is labeled as positive if it is part of the email signature, otherwise it is labeled as negative. Hand-crafted features (e.g. "line is blank") are used to represent each line of the email. The dataset totally contains 33,013 lines which about 10% are labeled as positive. They have also tested their technique on additional segmentation tasks such as classifying lines from FAQ documents, video segmentation, etc.

Given a sample S, the algorithm first segments S into K equal-sized disjoint subsets ( $S_{1},...,S_{k}$ ) and learn K functions $f_{1},...,f_{k}$ . Each function $f_{i}$ is trained on all the data except the i'th subset $S_{i}$ . We then construct set $S'=\{(x_{t},y'_{t}):y'_{t}=f_{j}(x_{t})\ and\ x_{t}\in S_{j}\}$ . This is the basis of sequential stacking algorithm. Set $S'$ is then used to create a new dataset of extended instances. An extended instance in a simplest case is a vector composed of an instance $(x_{i},y'_{i-1})$ where $y'_{i-1}$ is the (i-1)-th label in $y'$ .

In the initial results they have shown that stacked sequential learning can reduce the error of Maximum Entropy (ME) technique from 3.20% to 2.63%. They have also shown that they can achieve error rate of 0.71% by choosing window size of 20 in s-ME technique. Comparing to error rate of CRF which is 1.17% this is a statistically significant improvment.

@@ Line 1: / Line 1: @@
 == Citation ==
-Carlson, A., S. Schafer. 2008. Bootstrapping Information Extraction from Semi-structured Web Pages. ECML PKDD '08: Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I, 2008, 195-210, Berlin, Heidelberg.
+Cohen, W. W., & Carvalho, V. (2005). Stacked sequential learning. In Proceedings of the international joint
+conference on artificial intelligence (IJCAI).
 == Online version ==
-[[www.cs.cmu.edu/~acarlson/papers/carlson-ecml08.pdf|Carlson-ECML08]]
+[[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.69.8637&rep=rep1&type=pdf|Cohen-Carvalho]]
 == Summary ==
-This [[Category::paper]] introduces a novel schema for [[method::stacked sequential learning]]. [[method::Stacked sequential learning]] is a meta-learning technique that takes an arbitrary base learner and improves its performance by making it aware of nearby examples. In the experiments they have shown the stacked sequential technique improves performance of both sequential (e.g. [[method::conditional random field]]) and non-sequential algorithms. Their technique can be applied on any base learner and imposes only a constant overhead in training time.
+This [[Category::paper]] introduces a novel schema for [[UsesMethod::stacked sequential learning]]. [[UsesMethod::Stacked sequential learning]] is a meta-learning technique that takes an arbitrary base learner and improves its performance by making it aware of nearby examples. In the experiments they have shown that stacked sequential technique improves performance of both sequential (e.g. [[conditional random field]]) and non-sequential algorithms. This technique can be applied on any base learner and imposes only a constant overhead in training time.
-This technique is evaluated on the problem of recognizing the signature section of an email. The dataset contains 617 emails. Each email is represented by a vector of feature vectors. Each element of this vector is a feature vector for each line of the email. A line is labeled as positive if it is part of the email signature, otherwise it is labeled as negative. Hand-crafted features (e.g. "line is blank") are used to represent each line of the email. The dataset totally contains 33,013 lines which about 10% are labeled as positive. They have also tested their technique on additional segmentation tasks such as classifying lines from FAQ documents, video segmentation etc.
+This technique is evaluated on the problem of recognizing the signature section of an email. The dataset contains 617 emails. Each email is represented by a vector of feature vectors. Each element of this vector is a feature vector for each line of the email. A line is labeled as positive if it is part of the email signature, otherwise it is labeled as negative. Hand-crafted features (e.g. "line is blank") are used to represent each line of the email. The dataset totally contains 33,013 lines which about 10% are labeled as positive. They have also tested their technique on additional segmentation tasks such as classifying lines from FAQ documents, video segmentation, etc.
-Given a sample S, the algorithm first segments S into K equal-sized disjoint subsets (<math>S_1,...,S_k</math>) and learn K functions f_1,...,f_k. Each function f_i is trained on all data except the i'th (i.e. trained on subset S_i).
+Given a sample S, the algorithm first segments S into K equal-sized disjoint subsets (<math>S_1,...,S_k</math>) and learn K functions <math>f_1,...,f_k</math>. Each function <math>f_i</math> is trained on all the data except the i'th subset <math>S_i</math>. We then construct set <math>S'=\{(x_t,y'_t) : y'_t=f_j(x_t)\ and\ x_t \in S_j\}</math>. This is the basis of sequential stacking algorithm. Set <math>S'</math> is then used to create a new dataset of extended instances. An extended instance in a simplest case is a vector composed of an instance <math>(x_i,y'_{i-1})</math> where <math> y'_{i-1} </math> is the (i-1)-th label in <math>y'</math>.
+In the initial results they have shown that stacked sequential learning can reduce the error of [[Maximum Entropy]] (ME) technique from 3.20% to 2.63%. They have also shown that they can achieve error rate of 0.71% by choosing window size of 20 in [[s-ME]] technique. Comparing to error rate of CRF which is 1.17% this is a statistically significant improvment.
-In the first experimental results that they have done in they paper it is shown that [[method::maximum entropy markov model]] (MEMM) performs poorly on signature-detection problem.
-The intuition behind their technique is to use global features to infer rules about the local features. For example suppose that we know the name of a set of books. Then by looking at webpages of Amazon.com and by searching the name of books we can infer that the position and font of the book title
-is the same in most the webpages. We can then use these two features (position and font of book title in web pages) to extract new book titles.
-They have described both generative and discriminative approaches for classification and extraction tasks. Global features are govern by parameters that are shared by all data and local features are shared only by a subset of data. For example in information extraction task, all the words in a webpage (without considering formatting) can be considered as global features. On the other hand, features such as position of a text box or color of a text are local features.
-They have tested their method on two different datasets. The first dataset contains 1000 HTML documents. Each document is automatically divided into a set of words with similar layout characteristics and then are hand-labeled as containing or not containing a job title. The local and global features for this domain are the same as what we discussed above. The second dataset contain 42,548 web pages from 330 web sites which each web page is hand-labeled as if it is a press release or not press release. The global feature is a set of word in each webpage and local feature is the URL of the webpage. Their experimental result have shown that this approach can obtain high precision and low/moderate recall.
 == Related papers ==

Difference between revisions of "Cohen and Carvalho, 2005"

Latest revision as of 17:33, 1 February 2011

Contents

Citation

Online version

Summary

Related papers

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools