Machine Learning with Large Datasets 10-605
Description
Large datasets are difficult to work with for several reasons. They are difficult to visualize, and it is difficult to understand what sort of errors and biases are present in them. They are computationally expensive to process, and often the cost of learning is hard to predict - for instance, and algorithm that runs quickly in a dataset that fits in memory may be exorbitantly expensive when the dataset is too large for memory. Large datasets may also display qualitatively different behavior in terms of which learning methods produce the most accurate predictions.
This course is intended to provide a student practical knowledge of, and experience with, the issues involving large datasets. Among the issues considered are: scalable learning techniques, such as streaming machine learning techniques; parallel infrastructures such as map-reduce; practical techniques for reducing the memory requirements for learning methods, such as feature hashing and Bloom filters; and techniques for analysis of programs in terms of memory, disk usage, and (for parallel methods) communication complexity.
The class will include programming assignments, and a one-month short project chosen by the student. The project will be designed to compare the scalability of variant learning algorithms on datasets.
Prerequisites
An introductory course in machine learning is a prerequisite or a co-requisite. Undergraduates need permission of the instructor to enroll. The course assumes programming knowledge (in particular some knowledge of Java).
Planned Topics
- Week 1. Overview of course, and overview lecture on probabilities.
- Week 2. Streaming learning algorithms.
- Naive Bayes for discrete data.
- A streaming-data implementation of Naive Bayes.
- A streaming-data implementation of Naive Bayes assuming a larger-than-memory feature set, by using the 'stream and sort' pattern.
- Discussion of other streaming learning methods.
- Assignment: two implementations of Naive Bayes, one with feature-weights in memory, one purely streaming.
- Week 3. Examples of more complex programs using stream-and-sort.
- Lecture topics:
- Finding informative phrases in a corpus, and finding polar phrases in a corpus.
- Using records and messages to manage a complex dataflow.
- Key ideas: asymptotic time and disk-usage analysis of programs; using unlabeled data and distantly-labeled data.
- Assignment: phrase-finding and sentiment classification
- Lecture topics:
- Week 4. The map-reduce paradigm and Hadoop.
- Assignment: Hadoop re-implementation of assignments 1/2.
- Week 5. Reducing memory usage with randomized methods.
- Feature hashing and Vowpal Wabbit.
- Bloom filters for counting events.
- Locality-sensitive hashing.
- Assignment: memory-efficient Naive Bayes.
- Week 6-7. Nearest-neighbor finding and bulk classification.
- Using a search engine to find approximate nearest neighbors.
- Using inverted indices to find approximate nearest neighbors or to perform bulk linear classification.
- Implementing soft joins using map-reduce and nearest-neighbor methods.
- The local k-NN graph for a dataset.
- Assignment: Tool for approximate k-NN graph for a large dataset.
- Week 8-10. Working with large graphs.
- PageRank and RWR/PPR.
- Special issues involved with iterative processing on graphs in Map-Reduce: the schimmy pattern.
- Formalisms/environments for ierative processing on graphs: GraphLab, Sparks, Pregel.
- Extracting small graphs from a large one:
- LocalSpectral - finding the meaningful neighborhood of a query node in a large graph.
- Visualizing graphs.
- Semi-supervised classification on graphs.
- Clustering and community-finding in graphs.
- Assignments: Snowball sampling a graph with LocalSpectral and visualizing the results.
Week 11. Stochastic gradient descent and other streaming learning algorithms.
- SGD for logistic regression.
- Large feature sets SGD: delayed regularization-based updates; projection onto L1; truncated gradients.
Weeks 12-15. Additional topics.
- Scalable k-means clustering.
- Gibbs sampling and streaming LDA.
- Stacking and cascaded learning approaches.
- Decision tree learning for large datasets.