Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2012

From Cohen Courses
Revision as of 11:47, 14 November 2011 by Wcohen (talk | contribs) (→‎January)
Jump to navigationJump to search

This is the syllabus for Machine Learning with Large Datasets 10-605 in Spring 2012.

January

  • Tues Jan 17. Overview of course, cost of various operations, asymptotic analysis.
  • Thus Jan 19. Review of probabilities.
  • Tues Jan 24. Streaming algorithms and Naive Bayes.
    • Assignment: streaming Naive Bayes 1 (with feature counts in memory)
  • Thus Jan 26. Naive Bayes and logistic regression.
  • Tues Jan 31. Streaming stochastic gradient descent.
    • Assignment: streaming logistic regression (with feature weghts in memory)

February

Thus Feb 2. The stream-and-sort design pattern; Naive Bayes. Tues Feb 7. Messages and records; finding informative phrases.

  • Assignment: streaming Naive Bayes 2 (with feature counts on disk)

Thus Feb 9. Map-reduce and Hadoop 1 (Alona lecture). Tues Feb 14. Map-reduce and Hadoop 2. (Alona lecture).

  • Assignment: finding informative phrases

Draft

  • Overviews [1 week]
    • Lecture: Overview of course, cost of various operations, asymptotic analysis
    • Lecture: Review of probabilities
  • Streaming Learning algorithms [2 weeks]
    • Lecture: Naive Bayes, and a streaming implementation of it (features in memory).
      • Assignment: streaming Naive Bayes w/ features in memory
    • Lecture: Naive Bayes and logistic regression.
    • Lecture: SGD implementation of LogReg, with lazy regularization
      • Assignment: streaming LogReg w/ features in memory
    • Lecture: other streaming methods - the perceptron algorithm and Rocchio's algorithm.
  • Stream-and-sort [1.5 week]
    • Lecture: Naive Bayes when data's not in memory.
      • Assignment: stream-and-sort Naive Bayes (Twitter emoticon data?)
    • Lecture: finding informative phrases (with vocab counts in memory).
    • Lecture: messages and records; revisit finding informative phrases.
      • Assignment: finding informative phrases (Google books data)
  • Map-reduce and Hadoop [1 week]
    • Lecture: Alona, Map-reduce
    • Lecture: Alona, Hadoop and map-reduce
      • Assignment: finding informative phrases
  • Reducing memory usage with randomized methods [1.5 weeks]
    • Lecture: Locality-sensitive hashing.
    • Lecture: Bloom filters for counting events.
      • Assignment: LSH transformation of datasets
    • Lecture: Vowpal Wabbit and the hashing trick.

Planned Topics

  • Week 6-7. Nearest-neighbor finding and bulk classification.
    • Using a search engine to find approximate nearest neighbors.
    • Using inverted indices to find approximate nearest neighbors or to perform bulk linear classification.
    • Implementing soft joins using map-reduce and nearest-neighbor methods.
    • The local k-NN graph for a dataset.
    • Assignment: Tool for approximate k-NN graph for a large dataset.
  • Week 8-10. Working with large graphs.
    • PageRank and RWR/PPR.
    • Special issues involved with iterative processing on graphs in Map-Reduce: the schimmy pattern.
      • Formalisms/environments for iterative processing on graphs: GraphLab, Sparks, Pregel.
    • Extracting small graphs from a large one:
      • LocalSpectral - finding the meaningful neighborhood of a query node in a large graph.
      • Visualizing graphs.
    • Semi-supervised classification on graphs.
    • Clustering and community-finding in graphs.
    • Assignments: Snowball sampling a graph with LocalSpectral and visualizing the results.
  • Week 11. Stochastic gradient descent and other streaming learning algorithms.
    • SGD for logistic regression.
    • Large feature sets SGD: delayed regularization-based updates; projection onto L1; truncated gradients.
    • Assignment: Proposal for a one-month project.
  • Weeks 12-15. Additional topics.
    • Scalable k-means clustering.
    • Gibbs sampling and streaming LDA.
    • Stacking and cascaded learning approaches.
    • Decision tree learning for large datasets.
    • Assignment: Writeup of project results.