Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2012
From Cohen Courses
Schedule
- Overviews [1 week]
- Lecture: Overview of course, cost of various operations
- Lecture: Review of probabilities
- Streaming Learning algorithms [1.5 weeks]
- Lecture: Naive Bayes, and a streaming implementation of it (features in memory).
- Lecture: Naive Bayes and logistic regression.
- Lecture: SGD implementation of LogReg, with lazy regularization
- Stream-and-sort [1 week]
- Lecture: Naive Bayes when data's not in memory.
- Lecture: finding informative phrases (with vocab counts in memory).
Planned Topics
Draft - subject to change!
- Week 1. Overview of course, and overview lecture on probabilities.
- Week 2. Streaming learning algorithms.
- Naive Bayes for discrete data.
- A streaming-data implementation of Naive Bayes.
- A streaming-data implementation of Naive Bayes assuming a larger-than-memory feature set, by using the 'stream and sort' pattern.
- Discussion of other streaming learning methods.
- Rocchio
- Perceptron-style algorithms
- Streaming regression?
- Assignment: two implementations of Naive Bayes, one with feature-weights in memory, one purely streaming.
- Week 3. Examples of more complex programs using stream-and-sort.
- Lecture topics:
- Finding informative phrases in a corpus, and finding polar phrases in a corpus.
- Using records and messages to manage a complex dataflow.
- Assignment: phrase-finding and sentiment classification
- Lecture topics:
- Week 4. The map-reduce paradigm and Hadoop.
- Assignment: Hadoop re-implementation of assignments 1/2.
- Week 5. Reducing memory usage with randomized methods.
- Feature hashing and Vowpal Wabbit.
- Bloom filters for counting events.
- Locality-sensitive hashing.
- Assignment: memory-efficient Naive Bayes.
- Week 6-7. Nearest-neighbor finding and bulk classification.
- Using a search engine to find approximate nearest neighbors.
- Using inverted indices to find approximate nearest neighbors or to perform bulk linear classification.
- Implementing soft joins using map-reduce and nearest-neighbor methods.
- The local k-NN graph for a dataset.
- Assignment: Tool for approximate k-NN graph for a large dataset.
- Week 8-10. Working with large graphs.
- PageRank and RWR/PPR.
- Special issues involved with iterative processing on graphs in Map-Reduce: the schimmy pattern.
- Formalisms/environments for iterative processing on graphs: GraphLab, Sparks, Pregel.
- Extracting small graphs from a large one:
- LocalSpectral - finding the meaningful neighborhood of a query node in a large graph.
- Visualizing graphs.
- Semi-supervised classification on graphs.
- Clustering and community-finding in graphs.
- Assignments: Snowball sampling a graph with LocalSpectral and visualizing the results.
- Week 11. Stochastic gradient descent and other streaming learning algorithms.
- SGD for logistic regression.
- Large feature sets SGD: delayed regularization-based updates; projection onto L1; truncated gradients.
- Assignment: Proposal for a one-month project.
- Weeks 12-15. Additional topics.
- Scalable k-means clustering.
- Gibbs sampling and streaming LDA.
- Stacking and cascaded learning approaches.
- Decision tree learning for large datasets.
- Assignment: Writeup of project results.