Machine Learning with Large Datasets 10-605
From Cohen Courses
Description
Planned Topics
- Week 1. Overview of course, and overview lecture on probabilities.
- Week 2. Streaming learning algorithms.
- Naive Bayes for discrete data.
- A streaming-data implementation of Naive Bayes.
- A streaming-data implementation of Naive Bayes assuming a larger-than-memory feature set, by using the 'stream and sort' pattern. (Generate a stream of updates to an event counter, and then sort to collect all updates to a counter locally.)
- Discussion of other streaming learning methods.
- Week 3. Examples of more complex programs using stream-and-sort.
- Finding informative phrases in a corpus and finding polar phrases in a corpus.
- Using records and messages to manage a complex dataflow.
- Time and disk-access analysis of programs.
- Week 4. The map-reduce paradigm and Hadoop.
- Week 5. Reducing memory usage with randomized methods.
- Feature hashing and Vowpal Wabbit.
- Bloom filters for counting events.
- Locality-sensitive hashing.
- Week 6-7. Nearest-neighbor finding and bulk classification.
- Using a search engine to find approximate nearest neighbors.
- Using inverted indices to find approximate nearest neighbors or to perform bulk linear classification.
- Implementing soft joins using map-reduce and nearest-neighbor methods.
- The local k-NN graph for a dataset.
- Week 8-10. Working with large graphs.
- PageRank and RWR/PPR.
- Special issues involved with iterative processing on graphs in Map-Reduce: the schimmy pattern.
- Formalisms/environments for ierative processing on graphs: GraphLab, Sparks, Pregel.
- Extracting small graphs from a large one:
- LocalSpectral - finding the meaningful neighborhood of a query node in a large graph.
- Visualizing graphs.
- Semi-supervised classification on graphs.
- Clustering and community-finding in graphs.
Week 11. Stochastic gradient descent and other streaming learning algorithms.
Weeks 13-15. Additional topics.
- Scalable k-means clustering.
- Gibbs sampling and streaming LDA.
- Stacking and cascaded learning approaches.
- Decision tree learning for large datasets.