Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2012

January

Tues Jan 17. Overview of course, cost of various operations, asymptotic analysis.
Thus Jan 19. Review of probabilities.
Tues Jan 24. Streaming algorithms and Naive Bayes.
- Assignment: streaming Naive Bayes 1 (with feature counts in memory)
Thus Jan 26. Naive Bayes and logistic regression.
Tues Jan 31. Streaming stochastic gradient descent.
- Assignment: streaming logistic regression (with feature weghts in memory)

Thus Feb 2. The stream-and-sort design pattern; Naive Bayes.
Tues Feb 7. Messages and records; finding informative phrases.
- Assignment: streaming Naive Bayes 2 (with feature counts on disk) with stream-and-sort
Thus Feb 9. Map-reduce and Hadoop 1 (Alona lecture).
Tues Feb 14. Map-reduce and Hadoop 2. (Alona lecture).
- Assignment: finding informative phrases
Thus Feb 16.
Tues Feb 21.
Thus Feb 23..
Tues Feb 28.

Overviews [1 week]
- Lecture: Overview of course, cost of various operations, asymptotic analysis
- Lecture: Review of probabilities
Streaming Learning algorithms [2 weeks]
- Lecture: Naive Bayes, and a streaming implementation of it (features in memory).
  - Assignment: streaming Naive Bayes w/ features in memory
- Lecture: Naive Bayes and logistic regression.
- Lecture: SGD implementation of LogReg, with lazy regularization
  - Assignment: streaming LogReg w/ features in memory
- Lecture: other streaming methods - the perceptron algorithm and Rocchio's algorithm.
Stream-and-sort [1.5 week]
- Lecture: Naive Bayes when data's not in memory.
  - Assignment: stream-and-sort Naive Bayes (Twitter emoticon data?)
- Lecture: finding informative phrases (with vocab counts in memory).
- Lecture: messages and records; revisit finding informative phrases.
  - Assignment: finding informative phrases (Google books data)
Map-reduce and Hadoop [1 week]
- Lecture: Alona, Map-reduce
- Lecture: Alona, Hadoop and map-reduce
  - Assignment: finding informative phrases
Reducing memory usage with randomized methods [1.5 weeks]
- Lecture: Locality-sensitive hashing.
- Lecture: Bloom filters for counting events.
  - Assignment: LSH transformation of datasets
- Lecture: Vowpal Wabbit and the hashing trick.

Week 11. Stochastic gradient descent and other streaming learning algorithms.
- SGD for logistic regression.
- Large feature sets SGD: delayed regularization-based updates; projection onto L1; truncated gradients.
- Assignment: Proposal for a one-month project.