Difference between revisions of "Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2012"

From Cohen Courses
Jump to navigationJump to search
Line 7: Line 7:
 
* Tues Jan 24. Streaming algorithms and Naive Bayes.
 
* Tues Jan 24. Streaming algorithms and Naive Bayes.
 
** '''Assignment: streaming Naive Bayes 1 (with feature counts in memory)'''
 
** '''Assignment: streaming Naive Bayes 1 (with feature counts in memory)'''
* Thus Jan 26. Naive Bayes and logistic regression.
+
* Thus Jan 26. The stream-and-sort design pattern; Naive Bayes revisited.
* Tues Jan 31. Streaming stochastic gradient descent.
+
* Tues Jan 31. Messages and records 1; Phrase finding.
** '''Assignment: streaming logistic regression (with feature weghts in memory)'''
+
** '''Assignment: streaming Naive Bayes 2 (with feature counts on disk) with stream-and-sort'''
  
 
== February ==
 
== February ==
  
* Thus Feb 2. The stream-and-sort design pattern; Naive Bayes.
+
* Thus Feb 2. Messages and records 2; Phrase finding.
* Tues Feb 7. Messages and records; finding informative phrases.
+
* Tues Feb 7. Other streaming algorithms: voted perceptron, Rocchio; averaging.
** '''Assignment: streaming Naive Bayes 2 (with feature counts on disk) with stream-and-sort'''
+
** '''Assignment: phrase finding with stream-and-sort'''
 
* Thus Feb 9. Map-reduce and Hadoop 1 (Alona lecture).
 
* Thus Feb 9. Map-reduce and Hadoop 1 (Alona lecture).
 
* Tues Feb 14. Map-reduce and Hadoop 2. (Alona lecture).
 
* Tues Feb 14. Map-reduce and Hadoop 2. (Alona lecture).
** '''Assignment: finding informative phrases'''
+
** '''Assignment: Naive Bayes with Hadoop'''
* Thus Feb 16.
+
* Thus Feb 16. Naive Bayes and Logistic regression.
* Tues Feb 21.
+
* Tues Feb 21. Logistic regression with stochastic gradient descent.
* Thus Feb 23..
+
** '''Assignment: Phrase-finding with Hadoop'''
* Tues Feb 28.
+
* Thus Feb 23. Other SGD algorithms; parallelizing SGD.
 +
* Tues Feb 28. Bloom Filters and Locality sensitive hashing 1.
 +
** '''Assignment: memory-efficient SGD'''
  
 
== March ==
 
== March ==
  
* Thus Mar 1.
+
* Thus Mar 1. Bloom Filters and Locality sensitive hashing 2.
* Tues Mar 6.
+
* Tues Mar 6. Learning on graphs. PageRank, Harmonic field, RWR.
* Thus Mar 8.
+
** '''Assignment: mini-project proposals 1.'''
 +
* Thus Mar 8. Tools and design patterns for graphs (Pregel, GraphLab, Schimmy, ...)
 
* Tues Mar 13. ''no class - spring break.''
 
* Tues Mar 13. ''no class - spring break.''
 
* Thus Mar 15. ''no class - spring break.''
 
* Thus Mar 15. ''no class - spring break.''
* Tues Mar 20.
+
* Tues Mar 20. Spectral clustering and PIC.
* Thus Mar 22.
+
* '''Assignment: Subsampling and visualizing a graph.'''
* Tues Mar 27.
+
* Thus Mar 22. Gibbs sampling and LDA 1.
* Thus Mar 29.
+
* Tues Mar 27. Gibbs sampling and LDA 2.
 +
** '''Assignment: mini-project proposals 2.'''
 +
* Thus Mar 29. KNN classification and inverted indices.
  
 
== April ==
 
== April ==
  
* Tues Apr 3.
+
* Tues Apr 3. Decision trees and random forests 1.
* Thus Apr 5.
+
* Thus Apr 5. Decision trees and random forests 2.
* Tues Apr 10.
+
* Tues Apr 10. Soft joins with KNN and inverted indices 1.
* Thus Apr 12.
+
* Thus Apr 12. Soft joins with KNN and inverted indices 1.
* Tues Apr 17.
+
* Tues Apr 17. Structured prediction 1.
 
* Thus Apr 19. ''no class - Carnival''
 
* Thus Apr 19. ''no class - Carnival''
* Tues Apr 24.
+
* Tues Apr 24. Structured prediction 2.
* Thus Apr 26.
+
* Thus Apr 26. Additional topics.
  
 
== May ==
 
== May ==
  
* Tues May 1.
+
* Tues May 1. Project reports.
* Thus May 3.
+
* Thus May 3. Project reports.
 
 
== Draft ==
 
 
 
* Overviews [1 week]
 
** Lecture: Overview of course, cost of various operations, asymptotic analysis
 
** Lecture: Review of probabilities
 
* Streaming Learning algorithms [2 weeks]
 
** Lecture: Naive Bayes, and a streaming implementation of it (features in memory).
 
*** '''Assignment: streaming Naive Bayes w/ features in memory'''
 
** Lecture: Naive Bayes and logistic regression.
 
** Lecture: SGD implementation of LogReg, with lazy regularization
 
*** '''Assignment: streaming LogReg w/ features in memory'''
 
** Lecture: other streaming methods - the perceptron algorithm and Rocchio's algorithm.
 
* Stream-and-sort [1.5 week]
 
** Lecture: Naive Bayes when data's not in memory.
 
*** '''Assignment: stream-and-sort Naive Bayes''' (Twitter emoticon data?)
 
** Lecture: finding informative phrases (with vocab counts in memory).
 
** Lecture: messages and records; revisit finding informative phrases.
 
*** '''Assignment: finding informative phrases''' (Google books data)
 
* Map-reduce and Hadoop [1 week]
 
** Lecture: Alona, Map-reduce
 
** Lecture: Alona, Hadoop and map-reduce
 
*** '''Assignment: finding informative phrases'''
 
* Reducing memory usage with randomized methods [1.5 weeks]
 
** Lecture: Locality-sensitive hashing.
 
** Lecture: Bloom filters for counting events.
 
*** '''Assignment: LSH transformation of datasets'''
 
** Lecture: Vowpal Wabbit and the hashing trick.
 
 
 
== Planned Topics ==
 
 
 
* Week 6-7. Nearest-neighbor finding and bulk classification.
 
** Using a search engine to find approximate nearest neighbors.
 
** Using inverted indices to find approximate nearest neighbors or to perform bulk linear classification.
 
** Implementing soft joins using map-reduce and nearest-neighbor methods.
 
** The local k-NN graph for a dataset.
 
** '''Assignment''': Tool for approximate k-NN graph for a large dataset.
 
 
 
* Week 8-10. Working with large graphs.
 
** PageRank and RWR/PPR.
 
** Special issues involved with iterative processing on graphs in Map-Reduce: the schimmy pattern.
 
*** Formalisms/environments for iterative processing on graphs: GraphLab, Sparks, Pregel.
 
** Extracting small graphs from a large one:
 
*** LocalSpectral - finding the meaningful neighborhood of a query node in a large graph.
 
*** Visualizing graphs.
 
** Semi-supervised classification on graphs.
 
** Clustering and community-finding in graphs.
 
** '''Assignments''': Snowball sampling a graph with LocalSpectral and visualizing the results.
 
 
 
* Week 11. Stochastic gradient descent and other streaming learning algorithms.
 
** SGD for logistic regression.
 
** Large feature sets SGD: delayed regularization-based updates; projection onto L1; truncated gradients.
 
** '''Assignment''': Proposal for a one-month project.
 
 
 
* Weeks 12-15. Additional topics.
 
** Scalable k-means clustering.
 
** Gibbs sampling and streaming LDA.
 
** Stacking and cascaded learning approaches.
 
** Decision tree learning for large datasets.
 
** '''Assignment''': Writeup of project results.
 

Revision as of 12:46, 14 November 2011

This is the syllabus for Machine Learning with Large Datasets 10-605 in Spring 2012.

January

  • Tues Jan 17. Overview of course, cost of various operations, asymptotic analysis.
  • Thus Jan 19. Review of probabilities.
  • Tues Jan 24. Streaming algorithms and Naive Bayes.
    • Assignment: streaming Naive Bayes 1 (with feature counts in memory)
  • Thus Jan 26. The stream-and-sort design pattern; Naive Bayes revisited.
  • Tues Jan 31. Messages and records 1; Phrase finding.
    • Assignment: streaming Naive Bayes 2 (with feature counts on disk) with stream-and-sort

February

  • Thus Feb 2. Messages and records 2; Phrase finding.
  • Tues Feb 7. Other streaming algorithms: voted perceptron, Rocchio; averaging.
    • Assignment: phrase finding with stream-and-sort
  • Thus Feb 9. Map-reduce and Hadoop 1 (Alona lecture).
  • Tues Feb 14. Map-reduce and Hadoop 2. (Alona lecture).
    • Assignment: Naive Bayes with Hadoop
  • Thus Feb 16. Naive Bayes and Logistic regression.
  • Tues Feb 21. Logistic regression with stochastic gradient descent.
    • Assignment: Phrase-finding with Hadoop
  • Thus Feb 23. Other SGD algorithms; parallelizing SGD.
  • Tues Feb 28. Bloom Filters and Locality sensitive hashing 1.
    • Assignment: memory-efficient SGD

March

  • Thus Mar 1. Bloom Filters and Locality sensitive hashing 2.
  • Tues Mar 6. Learning on graphs. PageRank, Harmonic field, RWR.
    • Assignment: mini-project proposals 1.
  • Thus Mar 8. Tools and design patterns for graphs (Pregel, GraphLab, Schimmy, ...)
  • Tues Mar 13. no class - spring break.
  • Thus Mar 15. no class - spring break.
  • Tues Mar 20. Spectral clustering and PIC.
  • Assignment: Subsampling and visualizing a graph.
  • Thus Mar 22. Gibbs sampling and LDA 1.
  • Tues Mar 27. Gibbs sampling and LDA 2.
    • Assignment: mini-project proposals 2.
  • Thus Mar 29. KNN classification and inverted indices.

April

  • Tues Apr 3. Decision trees and random forests 1.
  • Thus Apr 5. Decision trees and random forests 2.
  • Tues Apr 10. Soft joins with KNN and inverted indices 1.
  • Thus Apr 12. Soft joins with KNN and inverted indices 1.
  • Tues Apr 17. Structured prediction 1.
  • Thus Apr 19. no class - Carnival
  • Tues Apr 24. Structured prediction 2.
  • Thus Apr 26. Additional topics.

May

  • Tues May 1. Project reports.
  • Thus May 3. Project reports.