Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2012
From Cohen Courses
This is the syllabus for Machine Learning with Large Datasets 10-605 in Spring 2012.
January
- Tues Jan 17. Overview of course, cost of various operations, asymptotic analysis.
- Thus Jan 19. Review of probabilities.
- Tues Jan 24. Streaming algorithms and Naive Bayes.
- Assignment: streaming Naive Bayes 1 (with feature counts in memory)
- Thus Jan 26. Naive Bayes and logistic regression.
- Tues Jan 31. Streaming stochastic gradient descent.
- Assignment: streaming logistic regression (with feature weghts in memory)
February
- Thus Feb 2. The stream-and-sort design pattern; Naive Bayes.
- Tues Feb 7. Messages and records; finding informative phrases.
- Assignment: streaming Naive Bayes 2 (with feature counts on disk) with stream-and-sort
- Thus Feb 9. Map-reduce and Hadoop 1 (Alona lecture).
- Tues Feb 14. Map-reduce and Hadoop 2. (Alona lecture).
- Assignment: finding informative phrases
- Thus Feb 16.
- Tues Feb 21.
- Thus Feb 23..
- Tues Feb 28.
March
- Thus Mar 1.
- Tues Mar 6.
- Thus Mar 8.
- Tues Mar 13. no class - spring break.
- Thus Mar 15. no class - spring break.
- Tues Mar 20.
- Thus Mar 22.
- Tues Mar 27.
- Thus Mar 29.
April
- Tues Apr 3.
- Thus Apr 5.
- Tues Apr 10.
- Thus Apr 12.
- Tues Apr 17.
- Thus Apr 19. no class - Carnival
- Tues Apr 24.
- Thus Apr 26.
May
- Tues May 1.
- Thus May 3.
Draft
- Overviews [1 week]
- Lecture: Overview of course, cost of various operations, asymptotic analysis
- Lecture: Review of probabilities
- Streaming Learning algorithms [2 weeks]
- Lecture: Naive Bayes, and a streaming implementation of it (features in memory).
- Assignment: streaming Naive Bayes w/ features in memory
- Lecture: Naive Bayes and logistic regression.
- Lecture: SGD implementation of LogReg, with lazy regularization
- Assignment: streaming LogReg w/ features in memory
- Lecture: other streaming methods - the perceptron algorithm and Rocchio's algorithm.
- Lecture: Naive Bayes, and a streaming implementation of it (features in memory).
- Stream-and-sort [1.5 week]
- Lecture: Naive Bayes when data's not in memory.
- Assignment: stream-and-sort Naive Bayes (Twitter emoticon data?)
- Lecture: finding informative phrases (with vocab counts in memory).
- Lecture: messages and records; revisit finding informative phrases.
- Assignment: finding informative phrases (Google books data)
- Lecture: Naive Bayes when data's not in memory.
- Map-reduce and Hadoop [1 week]
- Lecture: Alona, Map-reduce
- Lecture: Alona, Hadoop and map-reduce
- Assignment: finding informative phrases
- Reducing memory usage with randomized methods [1.5 weeks]
- Lecture: Locality-sensitive hashing.
- Lecture: Bloom filters for counting events.
- Assignment: LSH transformation of datasets
- Lecture: Vowpal Wabbit and the hashing trick.
Planned Topics
- Week 6-7. Nearest-neighbor finding and bulk classification.
- Using a search engine to find approximate nearest neighbors.
- Using inverted indices to find approximate nearest neighbors or to perform bulk linear classification.
- Implementing soft joins using map-reduce and nearest-neighbor methods.
- The local k-NN graph for a dataset.
- Assignment: Tool for approximate k-NN graph for a large dataset.
- Week 8-10. Working with large graphs.
- PageRank and RWR/PPR.
- Special issues involved with iterative processing on graphs in Map-Reduce: the schimmy pattern.
- Formalisms/environments for iterative processing on graphs: GraphLab, Sparks, Pregel.
- Extracting small graphs from a large one:
- LocalSpectral - finding the meaningful neighborhood of a query node in a large graph.
- Visualizing graphs.
- Semi-supervised classification on graphs.
- Clustering and community-finding in graphs.
- Assignments: Snowball sampling a graph with LocalSpectral and visualizing the results.
- Week 11. Stochastic gradient descent and other streaming learning algorithms.
- SGD for logistic regression.
- Large feature sets SGD: delayed regularization-based updates; projection onto L1; truncated gradients.
- Assignment: Proposal for a one-month project.
- Weeks 12-15. Additional topics.
- Scalable k-means clustering.
- Gibbs sampling and streaming LDA.
- Stacking and cascaded learning approaches.
- Decision tree learning for large datasets.
- Assignment: Writeup of project results.