Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2012

From Cohen Courses

Revision as of 16:47, 20 October 2011 by Wcohen (talk | contribs) (→‎Schedule)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Jump to navigation Jump to search

Schedule

Overviews [1 week]
- Lecture: Overview of course, cost of various operations
- Lecture: Review of probabilities
Streaming Learning algorithms [1.5 weeks]
- Lecture: Naive Bayes, and a streaming implementation of it (features in memory).
- Lecture: Naive Bayes and logistic regression.
- Lecture: SGD implementation of LogReg, with lazy regularization
Stream-and-sort [1 week]
- Lecture: Naive Bayes when data's not in memory.
- Lecture: finding informative phrases (with vocab counts in memory).

Planned Topics

Draft - subject to change!

Week 1. Overview of course, and overview lecture on probabilities.

Week 2. Streaming learning algorithms.
- Naive Bayes for discrete data.
- A streaming-data implementation of Naive Bayes.
- A streaming-data implementation of Naive Bayes assuming a larger-than-memory feature set, by using the 'stream and sort' pattern.
- Discussion of other streaming learning methods.
  - Rocchio
  - Perceptron-style algorithms
  - Streaming regression?
- Assignment: two implementations of Naive Bayes, one with feature-weights in memory, one purely streaming.

Week 3. Examples of more complex programs using stream-and-sort.
- Lecture topics:
  - Finding informative phrases in a corpus, and finding polar phrases in a corpus.
  - Using records and messages to manage a complex dataflow.
- Assignment: phrase-finding and sentiment classification

Week 4. The map-reduce paradigm and Hadoop.
- Assignment: Hadoop re-implementation of assignments 1/2.

Week 5. Reducing memory usage with randomized methods.
- Feature hashing and Vowpal Wabbit.
- Bloom filters for counting events.
- Locality-sensitive hashing.
- Assignment: memory-efficient Naive Bayes.

Week 6-7. Nearest-neighbor finding and bulk classification.
- Using a search engine to find approximate nearest neighbors.
- Using inverted indices to find approximate nearest neighbors or to perform bulk linear classification.
- Implementing soft joins using map-reduce and nearest-neighbor methods.
- The local k-NN graph for a dataset.
- Assignment: Tool for approximate k-NN graph for a large dataset.

Week 8-10. Working with large graphs.
- PageRank and RWR/PPR.
- Special issues involved with iterative processing on graphs in Map-Reduce: the schimmy pattern.
  - Formalisms/environments for iterative processing on graphs: GraphLab, Sparks, Pregel.
- Extracting small graphs from a large one:
  - LocalSpectral - finding the meaningful neighborhood of a query node in a large graph.
  - Visualizing graphs.
- Semi-supervised classification on graphs.
- Clustering and community-finding in graphs.
- Assignments: Snowball sampling a graph with LocalSpectral and visualizing the results.

Week 11. Stochastic gradient descent and other streaming learning algorithms.
- SGD for logistic regression.
- Large feature sets SGD: delayed regularization-based updates; projection onto L1; truncated gradients.
- Assignment: Proposal for a one-month project.

Weeks 12-15. Additional topics.
- Scalable k-means clustering.
- Gibbs sampling and streaming LDA.
- Stacking and cascaded learning approaches.
- Decision tree learning for large datasets.
- Assignment: Writeup of project results.

Retrieved from "http://curtis.ml.cmu.edu/w/courses/index.php?title=Syllabus_for_Machine_Learning_with_Large_Datasets_10-605_in_Spring_2012&oldid=9170"

Navigation menu