Difference between revisions of "Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2014"

From Cohen Courses
Jump to navigationJump to search
 
(86 intermediate revisions by 2 users not shown)
Line 1: Line 1:
This is the syllabus for [[Machine Learning with Large Datasets 10-605 in Spring 2014]].
+
This is the syllabus for [[Machine Learning with Large Datasets 10-605 in Spring 2014]]
 +
 
 +
Notes:
 +
* The assignments are from 2013, and will be modified over the course of the semester - some may be changed substantially.
 +
* Lecture notes will be posted around the time of the lectures.
  
 
== January ==
 
== January ==
  
 
* Mon Jan 13. [[Class meeting for 10-605 Overview|Overview of course, cost of various operations, asymptotic analysis.]]
 
* Mon Jan 13. [[Class meeting for 10-605 Overview|Overview of course, cost of various operations, asymptotic analysis.]]
* Wed Jan 15. [[Class meeting for 10-605 Probability Review|Review of probabilities.]]
+
* Wed Jan 15. [[Class meeting for 10-605 Probability Review|Review of probabilities, joint distributions and naive Bayes]]
* Mon Jan 20. '''No class for Martin Luther King Day.''
+
* Mon Jan 20. ''No class - Martin Luther King Day.''
 
* Wed Jan 22. [[Class meeting for 10-605 Streaming Naive Bayes|Streaming algorithms and Naive Bayes; The stream-and-sort design pattern; Naive Bayes for large feature sets.]]
 
* Wed Jan 22. [[Class meeting for 10-605 Streaming Naive Bayes|Streaming algorithms and Naive Bayes; The stream-and-sort design pattern; Naive Bayes for large feature sets.]]
** ''New Assignment: streaming Naive Bayes 1 (with feature counts in memory)''. [http://www.cs.cmu.edu/~wcohen/10-605/assignments/hashtable-nb.pdf PDF Handout]
+
** ''New Assignment: streaming Naive Bayes 1 (with feature counts in memory)''. [http://curtis.ml.cmu.edu/w/courses/images/6/6d/Hashtable-nb.pdf PDF Handout]
 
* Mon Jan 27. [[Class meeting for 10-605 Phase Finding|Messages and records 1; Phrase finding.]]
 
* Mon Jan 27. [[Class meeting for 10-605 Phase Finding|Messages and records 1; Phrase finding.]]
 
** '''Assignment due: streaming Naive Bayes 1 (with feature counts in memory)'''.   
 
** '''Assignment due: streaming Naive Bayes 1 (with feature counts in memory)'''.   
** ''New Assignment: streaming Naive Bayes 2 (with feature counts on disk) with stream-and-sort''. [http://www.cs.cmu.edu/~wcohen/10-605/assignments/stream-nb.pdf PDF Handout]
+
* Wed Jan 29. [[Class meeting for 10-605 Rocchio and On-line Learning|Phrase Finding and Rocchio]]
* Wed Jan 29. [[Class meeting for 10-605 Rocchio and On-line Learning|More on streaming algorithms: Rocchio, and theory of on-line learning]]
+
** ''New Assignment: streaming Naive Bayes 2 (with feature counts on disk) with stream-and-sort''. [http://curtis.ml.cmu.edu/w/courses/images/0/0d/Stream-nb.pdf PDF Handout]
 +
* Thursday Jan 30. Scheduled '''down-time for the wiki host'''.  (Obviously, it's up again now!)
  
 
== February ==
 
== February ==
  
* Mon Feb 3. [[Class meeting for 10-605 Parallel Perceptrons|More on streaming algorithms: parallelized voted perceptrons.]]
+
* Mon Feb 3. [[Class meeting for 10-605 Parallel Perceptrons|Rocchio and Parallel Perceptrons]]
 +
* Wed Feb 5. [[Class meeting for 10-605 Hadoop 1|Perceptrons/Map-reduce and Hadoop]].
 
** '''Assignment due: streaming Naive Bayes 2 (with feature counts on disk) with stream-and-sort'''
 
** '''Assignment due: streaming Naive Bayes 2 (with feature counts on disk) with stream-and-sort'''
** ''New Assignment: phrase finding with stream-and-sort''. [http://www.cs.cmu.edu/~wcohen/10-605/assignments/phrases.pdf PDF Handout]
+
** ''New Assignment: phrase finding with stream-and-sort''. [http://curtis.ml.cmu.edu/w/courses/images/5/5e/Phrases.pdf PDF Handout]
* Wed Feb 5. [[Class meeting for 10-605 Hadoop 1|Map-reduce and Hadoop 1]].
+
* Mon Feb 10. [[Class meeting for 10-605 Parallel Perceptrons 2|Parallel Perceptrons]].
* Mon Feb 10[[Class meeting for 10-605 Hadoop 2|Map-reduce and Hadoop 2]].
+
* Wed Feb 12. ''Guest lecture: Matt Hurst, Microsoft/Bing: Local Search at Bing''One-on-one meetings with Matt can be scheduled for Thursday 12/13 between 9-12 in Gates-Hillman 6501, afternoon meetings 12:30-1:30pm in '''Gates-Hillman 6002'''.
* Wed Feb 12. [[Class meeting for 10-605 Hadoop Helpers and SGD|Hadoop helpers and Scalable SGD]]
+
* Mon Feb 17. [[Class meeting for 10-605 SGD and Hash Kernels|Scalable SGD and Hash Kernels]]
 
** '''Assignment due: phrase finding with stream-and-sort'''
 
** '''Assignment due: phrase finding with stream-and-sort'''
** ''New Assignments: Naive Bayes with Hadoop & Phrase-finding with Hadoop''. [http://www.cs.cmu.edu/~wcohen/10-605/assignments/hadoop.pdf PDF Handout]  
+
** ''New Assignments: Naive Bayes with Streaming Hadoop,  Naive Bayes with Hadoop & Phrase-finding with Hadoop''. [http://curtis.ml.cmu.edu/w/courses/images/c/c0/Homework4a.pdf PDF Handout (4a)][http://curtis.ml.cmu.edu/w/courses/images/a/a2/Homework4b.pdf PDF Handout (4b)][http://curtis.ml.cmu.edu/w/courses/images/3/30/Homework4c.pdf PDF Handout (4c)]
* Mon Feb 17. [[Class meeting for 10-605 SGD and Hash Kernels|Scalable SGD and Hash Kernels]]
+
* Wed Feb 19. [[Class meeting for 10-605 SGD for MF|Matrix Factorization and SGD, plus another Hadoop demo]]
* Wed Feb 19. [[Class meeting for 10-605 SGD for MF|Matrix Factorization ad SGD]]
+
* Fri Feb 21. ''Nothing due - the streaming run for Naive Bayes, 4(a), has been postponed till Monday.''
** '''Streaming run on Hadoop of Naive Bayes due''' - checkpoint
+
* Mon Feb 24. [[Class meeting for 10-605 SGD for MF 2 and Randomized Algorithms|SGD for Matrix Factorization, and Randomized Algorithms 1 (Bloom Filters)]]
* Mon Feb 24. [[Class meeting for 10-605 Randomized Algorithms and Graphs 1|Background on randomized algorithms; Graph computations 1.]]
+
** '''Streaming run on Hadoop of Naive Bayes due'''  
* Wed Feb 26. [[Class meeting for 10-605 Graphs 2|Graphs computations 2]]
+
* Wed Feb 26. [[Class meeting for 10-605 Graphs 2|Randomized Algorithms]]
** '''Hadoop assignment (Naive Bayes) due'''
+
* Fri Feb 28.
 +
** '''Non-streaming run on Hadoop of Naive Bayes due.'''
  
 
== March  ==
 
== March  ==
  
* Mon Mar 3. "Guest Lecture: Garth Gibson, topic TBA"
+
* Mon Mar 3. ''Guest Lecture: Garth Gibson, Cloud Computing and Programming Paradigms''
* Wed Mar 4. ''Guest lecture: Alex Beutel, SGD on Hadoop"
+
** Slides: [http://www.cs.cmu.edu/~wcohen/10-605/garth-Intro.pptx Intro], [http://www.cs.cmu.edu/~wcohen/10-605/garth-MapReduce_majd.pdf Mapreduce], [http://www.cs.cmu.edu/~wcohen/10-605/garth-Programming.pptx Programming], [http://www.cs.cmu.edu/~wcohen/10-605/garth-UseCases.pptx Use Cases]
 +
* Wed Mar 5. ''Guest lecture: Alex Beutel, SGD on Hadoop''
 +
** [http://www.cs.cmu.edu/~wcohen/10-605/alex-beutel.pptx Slides]
 +
* Fri Mar 7.
 
** '''Hadoop assignment (phrase-finding) due'''
 
** '''Hadoop assignment (phrase-finding) due'''
** ''New Assignment: memory-efficient SGD'' [http://www.cs.cmu.edu/~wcohen/10-605/assignments/sgd.pdf PDF writeup]
 
** ''New assignment: initial project proposals.'' [http://www.cs.cmu.edu/~wcohen/10-605/assignments/initial-project-proposal.pdf PDF writeup]
 
 
* Mon Mar 10. ''no class - spring break.''
 
* Mon Mar 10. ''no class - spring break.''
 
* Wed Mar 12. ''no class - spring break.''
 
* Wed Mar 12. ''no class - spring break.''
* Mon Mar 17. [[Class meeting for 10-605 Subsample A Graph|Subsampling a graph with RWR]]
+
* Mon Mar 17. [[Class meeting for 10-605 Subsample A Graph|Scalable PageRank]]
* Wed Mar 20. [[Class meeting for 10-605 SSL via LP 1|Semi-supervised learning via label propagation on graphs]]
+
** ''New Assignment: memory-efficient SGD'' [http://curtis.ml.cmu.edu/w/courses/images/0/08/Sgd.pdf PDF handout]
 +
* Wed Mar 19. [[Class meeting for 10-605 Subsampling Graphs|Subsampling a graph with RWR]]
 +
* Mon Mar 24. [[Class meeting for 10-605 SSL on Graphs|Subsamping continued and SSL on Graphs]]
 +
* Wed Mar 26. [[Class meeting for 10-605 Spectral Clustering|Scalable spectral clustering techniques.]]
 +
** <strike>Assignment due: memory-efficient SGD</strike> delayed to Mon 3/31
 +
* Mon Mar 31. [[Class meeting for 10-605 LDA 1|Sparse sampling and parallelization for LDA]]
 
** '''Assignment due: memory-efficient SGD'''
 
** '''Assignment due: memory-efficient SGD'''
** ''New Assignment: Subsampling and visualizing a graph.'' [http://www.cs.cmu.edu/~wcohen/10-605/assignments/snowball.pdf PDF writeup]
+
** ''New Assignment: Subsampling and visualizing a graph.'' [http://curtis.ml.cmu.edu/w/courses/images/e/eb/ApproxPageRank.pdf PDF handout]
* Mon Mar 25. [[Class meeting for 10-605 SSL LP 2|Label propagation 2: Unsupervised label propagation, label propagation as optimization, bipartite graphs]]
 
* Wed Mar 27. [[Class meeting for 10-605 Spectral Clustering|Understanding spectral clustering techniques.]]
 
  
 
== April and May ==
 
== April and May ==
  
* Mon Apr 1. [[Class meeting for 10-605 2013 04 01|Speeding up LDA-like models: sparse sampling and parallelization]]
+
* Wed Apr 2. [[Class meeting for 10-605 2013 LDA 2|Speeding up LDA-like models: All-reduce and online LDA]]
* Wed Apr 3. [[Class meeting for 10-605 2013 04 03|Speeding up LDA-like models: All-reduce and online LDA]]
+
* Mon Apr 7. [[Class meeting for 10-605 PIG|Workflows in PIG]]
 +
* Wed Apr 9. [[Class meeting for 10-605 Similarity Joins|Fast KNN and similarity joins]]
 +
* Mon Apr 14.  [[Class meeting for 10-605 Parallel Similarity Joins|Parallel/Scalable Similarity Joins]]
 
** '''Assignment due: Subsampling and visualizing a graph.'''
 
** '''Assignment due: Subsampling and visualizing a graph.'''
** ''New Assignment: K-Means on MapReduce.'' [http://www.cs.cmu.edu/~wcohen/10-605/assignments/kmeans.pdf PDF writeup]
+
** ''New Assignment: Workflows with Pig'' [http://curtis.ml.cmu.edu/w/courses/images/4/46/Nb_pig.pdf PDF handout]
* Mon Apr 8. [[Class meeting for 10-605 2013 04 08|Fast KNN and similarity joins 1.]]
+
* Wed Apr 16. [[Class meeting for 10-605 First-Order Logics|First-order logics]]
* Wed Apr 10. [[Class meeting for 10-605 2013 04 10|Fast KNN and similarity joins 2.]]
+
* Mon Apr 21. [[Class meeting for 10-605 Scalable FOL|Scalable First-order logics]]
* Mon Apr 15. [[Class meeting for 10-605 2013 04 15|Scaling up decision tree learning]]
+
* Wed Apr 23.   [[Class meeting for 10-605 GraphLab|Graph models for large-scale ML]]
** '''Project progress report due'''
+
** '''Assignment due: Workflows with Pig'''
* Wed Apr 17. [[Class meeting for 10-605 2013 04 17|Gradient boosting with trees, and SGD for matrix factorization]]
+
* Mon Apr 28. Exam review session.  
** '''Assignment due: K-Means on MapReduce.'''
+
** [http://curtis.ml.cmu.edu/w/courses/images/0/0a/Practice_questions.pdf PDF practice questions]
** ''New Assignment: Multi-class image classification or scalable classification using a linear classifier.''  Both of these count as one assignment toward your six.
+
** [http://www.cs.cmu.edu/~wcohen/10-605/exam-review.pptx Review session slides]
*** [http://www.cs.cmu.edu/~wcohen/10-605/assignments/image.pdf PDF writeup of image-classification assignment]
+
* Wed Apr 30. In-class exam.
*** [http://www.cs.cmu.edu/~wcohen/10-605/assignments/big-classifier.pdf PDF writeup of scalable classification]
 
* Mon Apr 22. ''Guest lecture, Evangelos Papalexakis, on Scalable Tensor Methods.''
 
Project reports: '''Please upload your slides to Blackboard before the class, by *1:00pm*'''
 
* Wed Apr 24. Project reports.
 
** Team1: Namit Shetty, Namit Katariya
 
** Team2: Jieru Shi, Luzheng Sheng
 
** Team3: Edward Zhang, Weihua Cao, Yue Ma
 
** Team4: Yibin Lin, Yu Gong
 
** Team5: Sukhada Palkar
 
** Team6: Han Yang, Qiangjian Xi
 
** Team7: Russell Cullen, Jonathan Hsu
 
* Mon Apr 29. Project reports.
 
** Team8: Andrea Klein, Dipan Pal
 
** Team9: Zeyuan Li, Pengqi Liu, Fei Xie
 
** Team10: Yiwen Chen, Zhiqi Li, Yuliang Yin
 
** Team11: Ye Zhang, Hao Chen, Qi Wang
 
** Team12: Chunlei Liu, Zhen Tang
 
** Team13: Zaid Sheikh, Shourabh Rawat, Sushant Kumar
 
** Team14: Huanchen Zhang, Mengwei Ding
 
* Wed May 1. Project reports.
 
** Team15: Shu-Hao Yu, Guanyu Wang, Mayank Mohta
 
** Team16: Li Lu, Chun Chen, Yuchen Tian
 
** Team17: Shannon Quinn
 
** Team18: Avesh Singh, Adam Mihalcin
 
** Team19: Yubin Kim, Juan Manuel Caicedo Carvajal
 
** Team20: Yue Yu, Jie Dai, Mayank Ketkari
 
** Team21: Varuni Gang, Alkeshkumar Patel
 
** '''Assignment due: Multi-class image classification or scalable classification.'''
 
 
 
== May ==
 
 
 
* 9am, Tuesday, May 7.  '''Project writeups due'''.  Submit a paper to Blackbook in PDF in the [http://icml.cc/2013/wp-content/uploads/2012/12/icml2013stylefiles.tar.gz ICML 2013 format] (minimum 5 pp, up to 8pp double column), except, of course, do not submit it anonymously.
 
** ''Note: this is extended from previous deadline of Fri May 3---but I can't give any further extensions!''  Your project report should discuss
 
*** The problem you're trying to solve, and why it's important and/or interesting.
 
*** Related work, especially any related work that you're building on.
 
*** The data that you're working with.
 
*** The methods that you're using (in some detail - even if these are off-the-shelf methods, I want to know that you understand them)
 
*** The experiments you did, the metrics you used to evaluate them, and the results.
 
*** What was learned from the experiments (the conclusions).
 
** You should think of this as an exercise in writing a conference-style paper: so try and write in that style.  (Of course, your work doesn't need to advance the state-of-the-art in machine learning, or be highly novel, but it should be well-described.)
 

Latest revision as of 17:09, 2 June 2014

This is the syllabus for Machine Learning with Large Datasets 10-605 in Spring 2014.

Notes:

  • The assignments are from 2013, and will be modified over the course of the semester - some may be changed substantially.
  • Lecture notes will be posted around the time of the lectures.

January

February

March

April and May