Bootstrapping is a term used to define a general class of algorithms which benefit from a small set of labeled examples and a large amount of unlabeled data. Here, we specifically use text data as an example domain, though the process can be generalized to non-text domains.
By far the largest cost in applied machine learning, in terms of human labor, is annotation of data. In order to train supervised classifiers, a large number of fully labeled examples are necessary. This means that most research is limited to popular public corpora, such as the Penn Treebank, or data which can be easily labeled from metadata, such as crawled movie reviews in sentiment analysis. For novel research problems or real-world applications, however, annotation is necessary.
To avoid the slowest aspects of data annotation, a small number of high-precision seed features can be used. From this small number of carefully selected words, data can be quickly annotated and a model can be trained using this larger set of training information. Iteratively, new seeds can be identified from the output of this model; new documents can then be labelled based on the larger seed set; new models can then be trained on the newly labelled documents.
The following algorithm generically describes the process of bootstrapping:
* Initialization - A small number of hand-chosen seed keywords for each class are applied to the entire set of unlabeled examples, resulting in labels for those examples matching the keywords. * Iterate Bootstrapping: * Using Labels to Improve the Model: The information from the labeled examples is used to generate a new, more reliable, model. * Relabeling: The model is used to generate new labels for the unlabeled data.
This algorithm, much like expectation-maximization, can be instantiated in innumerable different ways. Cases are given below which exemplify some applications of bootstrapping.
Example from Topic Classification
An example seed from the domain of topic recognition is the word puck. In a classifier identifying whether a document is about baseball or hockey, the word puck is going to be a very high-precision indicator. Using these seeds, labels can be assigned to documents containing those seeds. If the seeds are balanced across classes, this will result in a small set of labelled examples from hockey and a small set from baseball, all of which are very likely to be correctly labelled.
From these instances, the most indicative words not currently in the seed set may be added. In this case, it may be slightly less definite words, such as base. The process then repeats, progressively finding lower precision seed words but annotating more and more training examples.
Bootstrapping with Patterns
This unstructured approach to bootstrapping, which focuses on classification tasks, generalizes well to structured problems. Information extraction is the most frequent structured domain where bootstrapping is used. In this context, seed patterns are chosen, instead of seed terms as before. Instead of labeling documents, they are instead used to find the words which occur in those high-precision contexts, essentially generating dictionaries of words which fall into a predefined category. These word lists can then be used to find new patterns, and again an iterative process can be implied. A key example here is named person extraction - a seed phrase such as [X] was hired will be very high-precision (few things can be hired which are not people, and the structure is relatively syntactically unambiguous). The tokens which fill in slot [X] can then be assumed to be people, and accordingly, contexts in which those people occur can then be used as new patterns.
Ellen Riloff has done a large amount of work on pattern extraction bootstrapping, such as Riloff et al., 2003.
Bootstrapping has close ties to active learning and co-training. Alternating between feature labeling and instance labeling, for instance, has been a major thrust of very recent work such as Settles, 2011.