Difference between revisions of "Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2012"

From Cohen Courses
Jump to navigationJump to search
 
(89 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== Schedule ==
+
This is the syllabus for [[Machine Learning with Large Datasets 10-605 in Spring 2012]].  '''If you're taking 10-605 now, you're probably looking for the syllabus for  [[Machine Learning with Large Datasets 10-605 in Spring 2013]].'''
  
* Overviews [1 week]
+
== January ==
** Lecture: Overview of course, cost of various operations, asymptotic analysis
 
** Lecture: Review of probabilities
 
* Streaming Learning algorithms [1.5 weeks]
 
** Lecture: Naive Bayes, and a streaming implementation of it (features in memory).
 
** Lecture: Naive Bayes and logistic regression.
 
** Lecture: SGD implementation of LogReg, with lazy regularization
 
* Stream-and-sort [1.5 week]
 
** Lecture: Naive Bayes when data's not in memory.
 
** Lecture: finding informative phrases (with vocab counts in memory).
 
** Lecture: messages and records; revisit finding informative phrases.
 
* Map-reduce and Hadoop [1 week]
 
** Lecture: Alona, using Hadoop
 
** Lecture: Alona, programming tips
 
  
== Planned Topics ==
+
* Tues Jan 17. [[Class meeting for 10-605 2012 01 17|Overview of course, cost of various operations, asymptotic analysis.]]
 +
* Thus Jan 19. [[Class meeting for 10-605 2012 01 19|Review of probabilities.]]
 +
* Tues Jan 24. [[Class meeting for 10-605 2012 01 24|Streaming algorithms and Naive Bayes.]]
 +
** ''New Assignment: streaming Naive Bayes 1 (with feature counts in memory)''. [http://www.cs.cmu.edu/~wcohen/10-605/assignments/hashtable-nb.pdf PDF Handout]
 +
* Thus Jan 26. [[Class meeting for 10-605 2012 01 26|The stream-and-sort design pattern; Naive Bayes revisited.]]
 +
* Tues Jan 31. [[Class meeting for 10-605 2012 01 31|Messages and records 1; Phrase finding.]]
 +
** '''Assignment due: streaming Naive Bayes 1 (with feature counts in memory)'''. 
 +
** ''New Assignment: streaming Naive Bayes 2 (with feature counts on disk) with stream-and-sort''. [http://www.cs.cmu.edu/~wcohen/10-605/assignments/stream-nb.pdf PDF Handout]
  
''Draft - subject to change!''
+
== February ==
  
* Week 1. Overview of course, and overview lecture on probabilities.
+
* Thus Feb 2. [[Class meeting for 10-605 2012 02 02|More on streaming algorithms: Rocchio, and theory of on-line learning]]
 +
* Tues Feb 7. [[Class meeting for 10-605 2012 02 07|More on streaming algorithms: parallelized voted perceptrons.]]
 +
** '''Assignment due: streaming Naive Bayes 2 (with feature counts on disk) with stream-and-sort'''
 +
** ''New Assignment: phrase finding with stream-and-sort''. [http://www.cs.cmu.edu/~wcohen/10-605/assignments/phrases.pdf PDF Handout]
 +
* Thus Feb 9. [[Class meeting for 10-605 2012 02 09|Map-reduce and Hadoop 1 (Alona lecture)]].
 +
* Tues Feb 14.  [[Class meeting for 10-605 2012 02 14|Map-reduce and Hadoop 2. (Alona lecture, William is closer)]].
 +
** '''Assignment due 2/15: phrase finding with stream-and-sort'''
 +
** ''New Assignment: Naive Bayes with Hadoop & Phrase-finding with Hadoop''  [http://www.cs.cmu.edu/~afyshe/Assignment4.pdf PDF Handout]
 +
* Thus Feb 16. [[Class meeting for 10-605 2012 02 16|Hadoop helpers and Scalable SGD]]
 +
* Tues Feb 21. [[Class meeting for 10-605 2012 02 21|Scalable SGD and Hash Kernels]]
 +
* Thus Feb 23. ''Guest lecture'': [http://www.cs.umass.edu/~ronb/ Ron Bekkerman], LinkedIn, Scaling up Machine Learning
 +
** [http://www.cs.cmu.edu/~wcohen/10-605/2012-02-23-bekkerman.pptx Ron's slides in Powerpoint]
 +
** [http://www.cs.cmu.edu/~wcohen/10-605/2012-02-23-bekkerman.pdf Ron's slides in PDF]
 +
* Tues Feb 28. [[Class meeting for 10-605 2012 02 28|Background on randomized algorithms; Graph computations 1.]]
  
* Week 2. Streaming learning algorithms.
+
== March ==
** Naive Bayes for discrete data. 
 
** A streaming-data implementation of Naive Bayes.
 
** A streaming-data implementation of Naive Bayes assuming a larger-than-memory feature set, by using the 'stream and sort' pattern.
 
** Discussion of other streaming learning methods.
 
*** Rocchio
 
*** Perceptron-style algorithms
 
*** Streaming regression?
 
** '''Assignment''': two implementations of Naive Bayes, one with feature-weights in memory, one purely streaming.
 
  
* Week 3. Examples of more complex programs using stream-and-sort.
+
* Thus Mar 1. ''Guest Lecture'': Ben van Durme, JHU, Randomized Algorithms for Large-Scale Learning
** Lecture topics:
+
* Tues Mar 6. [[Class meeting for 10-605 2012 03 06|Learning on graphs 2]].
*** Finding informative phrases in a corpus, and finding polar phrases in a corpus.
+
** '''Hadoop assignments due'''
*** Using records and messages to manage a complex dataflow.
+
** ''New Assignment: memory-efficient SGD'' [http://www.cs.cmu.edu/~wcohen/10-605/assignments/sgd.pdf PDF writeup]
** '''Assignment''': phrase-finding and sentiment classification
+
** ''New assignment: initial project proposals.'' [http://www.cs.cmu.edu/~wcohen/10-605/assignments/initial-project-proposal.pdf PDF writeup]
 +
* Thus Mar 8. ''Guest Lecture'': Joey Gonzales, CMU, GraphLab and Dynamic Asynchronous Computation [http://www.cs.cmu.edu/~jegonzal/talks/biglearning_with_graphs.pptx PPT slides]
 +
* Tues Mar 13. ''no class - spring break.''
 +
* Thus Mar 15. ''no class - spring break.''
 +
* Tues Mar 20. [[Class meeting for 10-605 2012 03 20|Subsampling a graph with RWR]]
 +
** '''Assignment due: initial mini-project proposals.'''
 +
** '''Assignment due: memory-efficient SGD'''
 +
** ''New Assignment: Subsampling and visualizing a graph.'' [http://www.cs.cmu.edu/~wcohen/10-605/assignments/snowball.pdf PDF writeup]
 +
* Thus Mar 22. [[Class meeting for 10-605 2012 03 22|Semi-supervised learning via label propagation on graphs]]
 +
* Tues Mar 27. [[Class meeting for 10-605 2012 03 27|Label propagation 2: Unsupervised label propagation, label propagation as optimization, bipartite graphs]]
 +
** '''Assignment due: Subsampling and visualizing a graph.'''
 +
** ''New Assignment: mini-project proposals (final version)''
 +
* Thus Mar 29. [[Class meeting for 10-605 2012 03 29|Understanding spectral clustering techniques.]]
 +
** '''Assignment due: mini-project proposals (final version).'''
  
* Week 4. The map-reduce paradigm and Hadoop.
+
== April ==
** '''Assignment''': Hadoop re-implementation of assignments 1/2.
 
  
* Week 5. Reducing memory usage with randomized methods.
+
* Tues Apr 3. [[Class meeting for 10-605 2012 04 03|LDA-like models for text and graphs]]; guest lecture from Partha Talukdar
** Feature hashing and Vowpal Wabbit.
+
* Thus Apr 5. Tentative: Guest lecture by U Kang, CMU.
** Bloom filters for counting events.
+
* Tues Apr 10. [[Class meeting for 10-605 2012 04 10|Speeding up LDA-like models: sampling and parallelization]]
** Locality-sensitive hashing.
+
* Thus Apr 12. [[Class meeting for 10-605 2012 04 12|Fast KNN and similarity joins 1.]]
** '''Assignment''': memory-efficient Naive Bayes.
+
* Tues Apr 17. [[Class meeting for 10-605 2012 04 17|Fast KNN and similarity joins 2.]]
 +
* Thus Apr 19. ''no class - Carnival''
 +
* Tues Apr 24. [[Class meeting for 10-605 2012 04 14|SGD for matrix factorization and online LDA]]
 +
* Thus Apr 26. [[Class meeting for 10-605 2012 04 16|Scaling up decision tree learning]]
  
* Week 6-7. Nearest-neighbor finding and bulk classification.
+
== May ==
** Using a search engine to find approximate nearest neighbors.
 
** Using inverted indices to find approximate nearest neighbors or to perform bulk linear classification.
 
** Implementing soft joins using map-reduce and nearest-neighbor methods.
 
** The local k-NN graph for a dataset.
 
** '''Assignment''': Tool for approximate k-NN graph for a large dataset.
 
  
* Week 8-10. Working with large graphs.
+
* Tues May 1. Project reports.
** PageRank and RWR/PPR.
+
* Thus May 3. Project reports.
** Special issues involved with iterative processing on graphs in Map-Reduce: the schimmy pattern.
+
* Fri May 4.  
*** Formalisms/environments for iterative processing on graphs: GraphLab, Sparks, Pregel.
+
** '''Project writeups due at 5:00pm'''.  Submit a paper to Blackbook in PDF in the [http://icml.cc/2012/author-instructions/ ICML 2012 format] (up to 8pp double column), except, of course, do not submit it anonymously.
** Extracting small graphs from a large one:
 
*** LocalSpectral - finding the meaningful neighborhood of a query node in a large graph.
 
*** Visualizing graphs.
 
** Semi-supervised classification on graphs.
 
** Clustering and community-finding in graphs.
 
** '''Assignments''': Snowball sampling a graph with LocalSpectral and visualizing the results.
 
 
 
* Week 11. Stochastic gradient descent and other streaming learning algorithms.
 
** SGD for logistic regression.
 
** Large feature sets SGD: delayed regularization-based updates; projection onto L1; truncated gradients.
 
** '''Assignment''': Proposal for a one-month project.
 
 
 
* Weeks 12-15. Additional topics.
 
** Scalable k-means clustering.
 
** Gibbs sampling and streaming LDA.
 
** Stacking and cascaded learning approaches.
 
** Decision tree learning for large datasets.
 
** '''Assignment''': Writeup of project results.
 

Latest revision as of 09:48, 28 March 2013

This is the syllabus for Machine Learning with Large Datasets 10-605 in Spring 2012. If you're taking 10-605 now, you're probably looking for the syllabus for Machine Learning with Large Datasets 10-605 in Spring 2013.

January

February

March

April

May

  • Tues May 1. Project reports.
  • Thus May 3. Project reports.
  • Fri May 4.
    • Project writeups due at 5:00pm. Submit a paper to Blackbook in PDF in the ICML 2012 format (up to 8pp double column), except, of course, do not submit it anonymously.