Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2012

From Cohen Courses
Revision as of 16:44, 20 October 2011 by Wcohen (talk | contribs) (Created page with '== Schedule == * Overviews [1 week] ** Lecture: Overview of course, cost of various operations ** Lecture: Review of probabilities * Streaming Learning algorithms [2 weeks] ** L…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Schedule

  • Overviews [1 week]
    • Lecture: Overview of course, cost of various operations
    • Lecture: Review of probabilities
  • Streaming Learning algorithms [2 weeks]
    • Lecture: Naive Bayes, and a streaming implementation of it (features in memory).
    • Lecture: Naive Bayes and logistic regression.
    • Lecture: SGD implementation of LogReg, with lazy regularization.
  • Stream-and-sort.
    • Lecture: Naive Bayes when data's not in memory rocchio when data's not in memory?

Planned Topics

Draft - subject to change!

  • Week 1. Overview of course, and overview lecture on probabilities.
  • Week 2. Streaming learning algorithms.
    • Naive Bayes for discrete data.
    • A streaming-data implementation of Naive Bayes.
    • A streaming-data implementation of Naive Bayes assuming a larger-than-memory feature set, by using the 'stream and sort' pattern.
    • Discussion of other streaming learning methods.
      • Rocchio
      • Perceptron-style algorithms
      • Streaming regression?
    • Assignment: two implementations of Naive Bayes, one with feature-weights in memory, one purely streaming.
  • Week 3. Examples of more complex programs using stream-and-sort.
    • Lecture topics:
      • Finding informative phrases in a corpus, and finding polar phrases in a corpus.
      • Using records and messages to manage a complex dataflow.
    • Assignment: phrase-finding and sentiment classification
  • Week 4. The map-reduce paradigm and Hadoop.
    • Assignment: Hadoop re-implementation of assignments 1/2.
  • Week 5. Reducing memory usage with randomized methods.
    • Feature hashing and Vowpal Wabbit.
    • Bloom filters for counting events.
    • Locality-sensitive hashing.
    • Assignment: memory-efficient Naive Bayes.
  • Week 6-7. Nearest-neighbor finding and bulk classification.
    • Using a search engine to find approximate nearest neighbors.
    • Using inverted indices to find approximate nearest neighbors or to perform bulk linear classification.
    • Implementing soft joins using map-reduce and nearest-neighbor methods.
    • The local k-NN graph for a dataset.
    • Assignment: Tool for approximate k-NN graph for a large dataset.
  • Week 8-10. Working with large graphs.
    • PageRank and RWR/PPR.
    • Special issues involved with iterative processing on graphs in Map-Reduce: the schimmy pattern.
      • Formalisms/environments for iterative processing on graphs: GraphLab, Sparks, Pregel.
    • Extracting small graphs from a large one:
      • LocalSpectral - finding the meaningful neighborhood of a query node in a large graph.
      • Visualizing graphs.
    • Semi-supervised classification on graphs.
    • Clustering and community-finding in graphs.
    • Assignments: Snowball sampling a graph with LocalSpectral and visualizing the results.
  • Week 11. Stochastic gradient descent and other streaming learning algorithms.
    • SGD for logistic regression.
    • Large feature sets SGD: delayed regularization-based updates; projection onto L1; truncated gradients.
    • Assignment: Proposal for a one-month project.
  • Weeks 12-15. Additional topics.
    • Scalable k-means clustering.
    • Gibbs sampling and streaming LDA.
    • Stacking and cascaded learning approaches.
    • Decision tree learning for large datasets.
    • Assignment: Writeup of project results.