Difference between revisions of "Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2012"

From Cohen Courses
Jump to navigationJump to search
Line 15: Line 15:
 
Thus Feb 2. The stream-and-sort design pattern; Naive Bayes.
 
Thus Feb 2. The stream-and-sort design pattern; Naive Bayes.
 
Tues Feb 7. Messages and records; finding informative phrases.
 
Tues Feb 7. Messages and records; finding informative phrases.
*** '''Assignment: streaming Naive Bayes 2 (with feature counts on disk)'''
+
* '''Assignment: streaming Naive Bayes 2 (with feature counts on disk)'''
 
Thus Feb 9. Map-reduce and Hadoop 1 (Alona lecture).
 
Thus Feb 9. Map-reduce and Hadoop 1 (Alona lecture).
 
Tues Feb 14. Map-reduce and Hadoop 2. (Alona lecture).
 
Tues Feb 14. Map-reduce and Hadoop 2. (Alona lecture).
*** '''Assignment: finding informative phrases'''
+
* '''Assignment: finding informative phrases'''
  
 
== Draft ==
 
== Draft ==

Revision as of 11:46, 14 November 2011

This is the syllabus for Machine Learning with Large Datasets 10-605 in Spring 2012.

January

Tues Jan 17. Overview of course, cost of various operations, asymptotic analysis. Thus Jan 19. Review of probabilities. Tues Jan 24. Streaming algorithms and Naive Bayes.

  • Assignment: streaming Naive Bayes 1 (with feature counts in memory)

Thus Jan 26. Naive Bayes and logistic regression. Tues Jan 31. Streaming stochastic gradient descent.

  • Assignment: streaming logistic regression (with feature weghts in memory)

February

Thus Feb 2. The stream-and-sort design pattern; Naive Bayes. Tues Feb 7. Messages and records; finding informative phrases.

  • Assignment: streaming Naive Bayes 2 (with feature counts on disk)

Thus Feb 9. Map-reduce and Hadoop 1 (Alona lecture). Tues Feb 14. Map-reduce and Hadoop 2. (Alona lecture).

  • Assignment: finding informative phrases

Draft

  • Overviews [1 week]
    • Lecture: Overview of course, cost of various operations, asymptotic analysis
    • Lecture: Review of probabilities
  • Streaming Learning algorithms [2 weeks]
    • Lecture: Naive Bayes, and a streaming implementation of it (features in memory).
      • Assignment: streaming Naive Bayes w/ features in memory
    • Lecture: Naive Bayes and logistic regression.
    • Lecture: SGD implementation of LogReg, with lazy regularization
      • Assignment: streaming LogReg w/ features in memory
    • Lecture: other streaming methods - the perceptron algorithm and Rocchio's algorithm.
  • Stream-and-sort [1.5 week]
    • Lecture: Naive Bayes when data's not in memory.
      • Assignment: stream-and-sort Naive Bayes (Twitter emoticon data?)
    • Lecture: finding informative phrases (with vocab counts in memory).
    • Lecture: messages and records; revisit finding informative phrases.
      • Assignment: finding informative phrases (Google books data)
  • Map-reduce and Hadoop [1 week]
    • Lecture: Alona, Map-reduce
    • Lecture: Alona, Hadoop and map-reduce
      • Assignment: finding informative phrases
  • Reducing memory usage with randomized methods [1.5 weeks]
    • Lecture: Locality-sensitive hashing.
    • Lecture: Bloom filters for counting events.
      • Assignment: LSH transformation of datasets
    • Lecture: Vowpal Wabbit and the hashing trick.

Planned Topics

  • Week 6-7. Nearest-neighbor finding and bulk classification.
    • Using a search engine to find approximate nearest neighbors.
    • Using inverted indices to find approximate nearest neighbors or to perform bulk linear classification.
    • Implementing soft joins using map-reduce and nearest-neighbor methods.
    • The local k-NN graph for a dataset.
    • Assignment: Tool for approximate k-NN graph for a large dataset.
  • Week 8-10. Working with large graphs.
    • PageRank and RWR/PPR.
    • Special issues involved with iterative processing on graphs in Map-Reduce: the schimmy pattern.
      • Formalisms/environments for iterative processing on graphs: GraphLab, Sparks, Pregel.
    • Extracting small graphs from a large one:
      • LocalSpectral - finding the meaningful neighborhood of a query node in a large graph.
      • Visualizing graphs.
    • Semi-supervised classification on graphs.
    • Clustering and community-finding in graphs.
    • Assignments: Snowball sampling a graph with LocalSpectral and visualizing the results.
  • Week 11. Stochastic gradient descent and other streaming learning algorithms.
    • SGD for logistic regression.
    • Large feature sets SGD: delayed regularization-based updates; projection onto L1; truncated gradients.
    • Assignment: Proposal for a one-month project.
  • Weeks 12-15. Additional topics.
    • Scalable k-means clustering.
    • Gibbs sampling and streaming LDA.
    • Stacking and cascaded learning approaches.
    • Decision tree learning for large datasets.
    • Assignment: Writeup of project results.