Bootstrapping

From Cohen Courses
Revision as of 22:24, 29 November 2011 by Emayfiel (talk | contribs) (→‎Motivation)
Jump to navigationJump to search

Bootstrapping is a term used to define a general class of algorithms which benefit from a small set of labeled examples and a large amount of unlabeled data.

Motivation

By far the largest cost in applied machine learning, in terms of human labor, is annotation of data. In order to train supervised classifiers, a large number of fully labeled examples are necessary. This means that most research is limited to popular public corpora, such as the Penn Treebank, or data which can be easily labeled from metadata, such as crawled movie reviews in sentiment analysis.

For novel research problems or real-world applications, however, annotation is necessary. Bootstrapping is an approach which labels features, rather than instances. An example from the domain of topic recognition is the word puck. In a classifier identifying whether a document is about baseball or hockey, the word puck is going to be a very high-precision indicator.

Relevant Papers