Difference between revisions of "Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2015"

From Cohen Courses
Jump to navigationJump to search
(Created page with "This is the syllabus for Machine Learning with Large Datasets 10-605 in Spring 2015. Notes: * The assignments are from 2014, and will be modified over the course of th...")
 
 
(134 intermediate revisions by 5 users not shown)
Line 2: Line 2:
  
 
Notes:  
 
Notes:  
* The assignments are from 2014, and will be modified over the course of the semester - some may be changed substantially.
+
* The assignments posted are '''drafts''' based on the assignments from 2014, and will be modified over the course of the semester - some may be changed substantially.
* Lecture notes and/or slides will be posted around the time of the lectures.
+
* Lecture notes and/or slides will be (re)posted around the time of the lectures.
  
 
== January ==
 
== January ==
  
* Mon Jan 13. [[Class meeting for 10-605 Overview|Overview of course, cost of various operations, asymptotic analysis.]]
+
* Tues Jan 13. [[Class meeting for 10-605 Overview|Overview of course, cost of various operations, asymptotic analysis.]]
* Wed Jan 15. [[Class meeting for 10-605 Probability Review|Review of probabilities, joint distributions and naive Bayes]]
+
* Thus Jan 15. [[Class meeting for 10-605 Probability Review|Review of probabilities, joint distributions and naive Bayes]]
* Mon Jan 20. ''No class - Martin Luther King Day.''
+
** ''HW1A: streaming Naive Bayes 1 (with feature counts in memory)''. [http://www.cs.cmu.edu/~yipeiw/TA605/hw1A/hashtable-nb_s15.pdf PDF Handout]
* Wed Jan 22. [[Class meeting for 10-605 Streaming Naive Bayes|Streaming algorithms and Naive Bayes; The stream-and-sort design pattern; Naive Bayes for large feature sets.]]
+
* Tues Jan 20. [[Class meeting for 10-605 Streaming Naive Bayes|Streaming algorithms and Naive Bayes; The stream-and-sort design pattern; Naive Bayes for large feature sets.]]
** ''New Assignment: streaming Naive Bayes 1 (with feature counts in memory)''. [http://curtis.ml.cmu.edu/w/courses/images/6/6d/Hashtable-nb.pdf PDF Handout]
+
** ''HW1B: streaming Naive Bayes 2 (with feature counts on disk) with stream-and-sort''. [http://www.cs.cmu.edu/afs/cs/user/rgoutam/www/Assignment_1b.pdf PDF Handout]
* Mon Jan 27. [[Class meeting for 10-605 Phase Finding|Messages and records 1; Phrase finding.]]
+
** For 10/11-805 students: '''a one-paragraph summary of a recent research result you'd like to present is due.'''  If you're planning/hoping to transfer from 605, but haven't yet transferred, then also submit this assignment.  Email to wcohen+805 AT gmail.com with the subject "Presentation" and include, in addition to your summary:
** '''Assignment due: streaming Naive Bayes 1 (with feature counts in memory)'''. 
+
***Your name and andrew id
* Wed Jan 29. [[Class meeting for 10-605 Rocchio and On-line Learning|Phrase Finding and Rocchio]]
+
*** A link to the paper
** ''New Assignment: streaming Naive Bayes 2 (with feature counts on disk) with stream-and-sort''. [http://curtis.ml.cmu.edu/w/courses/images/0/0d/Stream-nb.pdf PDF Handout]
+
*** Your best guess as to what lectures should precede the presentation
* Thursday Jan 30. Scheduled '''down-time for the wiki host'''. (Obviously, it's up again now!)
+
*** '''Due by 11:59:59pm EST Tuesday.'''
 +
* Thus Jan 22. [[Class meeting for 10-605 Phase Finding|Messages, records and workflows; Phrase finding.]]
 +
* Tues Jan 27. [[Class meeting for 10-605 Hadoop 1|Hadoop and Map-Reduce]]
 +
* Thus Jan 29. [[Class meeting for 10-605 PIG|PIG and Other Workflow Systems for Hadoop]]
 +
** '''HW1A and HW1B due.'''
 +
** ''HW2: phrase finding with stream-and-sort''. [http://www.cs.cmu.edu/~yipeiw/TA605/phrases.pdf PDF Handout] [http://www.cs.cmu.edu/~yipeiw/TA605/stopword.list Stopword List]
  
 
== February ==
 
== February ==
  
* Mon Feb 3. [[Class meeting for 10-605 Parallel Perceptrons|Rocchio and Parallel Perceptrons]]
+
* Tues Feb 3. [[Class_meeting_for_10-605_Rocchio_and_On-line_Learning|Rocchio and TFIDF]]
* Wed Feb 5. [[Class meeting for 10-605 Hadoop 1|Perceptrons/Map-reduce and Hadoop]].
+
* Thus Feb 5.  [[Class meeting for 10-605 Similarity Joins|Fast KNN and similarity joins]]
** '''Assignment due: streaming Naive Bayes 2 (with feature counts on disk) with stream-and-sort'''
+
* Tues Feb 10. [[Class meeting for 10-605 Parallel Perceptrons 1|Parallel Perceptrons 1]].
** ''New Assignment: phrase finding with stream-and-sort''. [http://curtis.ml.cmu.edu/w/courses/images/5/5e/Phrases.pdf PDF Handout]
+
* Thus Feb 12. [[Class meeting for 10-605 Parallel Perceptrons 2|Parallel Perceptrons 2]].
* Mon Feb 10. [[Class meeting for 10-605 Parallel Perceptrons 2|Parallel Perceptrons]].
+
** '''HW2 due: phrase finding with stream-and-sort'''
* Wed Feb 12. ''Guest lecture: Matt Hurst, Microsoft/Bing: Local Search at Bing''.  One-on-one meetings with Matt can be scheduled for Thursday 12/13 between 9-12 in Gates-Hillman 6501, afternoon meetings 12:30-1:30pm in '''Gates-Hillman 6002'''.
+
* Tues Feb 17. [[Class meeting for 10-605 SGD and Hash Kernels|Scalable SGD and Hash Kernels]]
* Mon Feb 17. [[Class meeting for 10-605 SGD and Hash Kernels|Scalable SGD and Hash Kernels]]
+
** ''HW3: Naive Bayes with Hadoop MapReduce''. PDF Handouts: [http://www.andrew.cmu.edu/user/amaurya/docs/10605/homework3.pdf HW3].
** '''Assignment due: phrase finding with stream-and-sort'''
+
** ''For 10/11-805 students:'' '''initial draft of project proposal is due.'''  I will give you feedback on this, so please be clear about your proposal. I'm expecting approximately one page. You should discuss what dataset you plan to use, what results you hope to obtain, what baseline technique you will build on and/or compare to.  Also include a section saying if you have a partner; and if you are willing to work with/mentor one or more 605 students, and if so, how you anticipate them contributing to the project.
** ''New Assignments: Naive Bayes with Streaming Hadoop,  Naive Bayes with Hadoop & Phrase-finding with Hadoop''. [http://curtis.ml.cmu.edu/w/courses/images/c/c0/Homework4a.pdf PDF Handout (4a)][http://curtis.ml.cmu.edu/w/courses/images/a/a2/Homework4b.pdf PDF Handout (4b)][http://curtis.ml.cmu.edu/w/courses/images/3/30/Homework4c.pdf PDF Handout (4c)]
+
* Thus Feb 19. [[Class meeting for 10-605 Randomized|Randomized Algorithms 1]]
* Wed Feb 19. [[Class meeting for 10-605 SGD for MF|Matrix Factorization and SGD, plus another Hadoop demo]]
+
* Tues Feb 24. [[Class meeting for 10-605 Randomized|Randomized Algorithms 2]]
* Fri Feb 21. ''Nothing due - the streaming run for Naive Bayes, 4(a), has been postponed till Monday.''
+
* Thus Feb 26. [[Class meeting for 10-605 SGD for MF|Matrix Factorization and SGD]]
* Mon Feb 24. [[Class meeting for 10-605 SGD for MF 2 and Randomized Algorithms|SGD for Matrix Factorization, and Randomized Algorithms 1 (Bloom Filters)]]
 
** '''Streaming run on Hadoop of Naive Bayes due'''
 
* Wed Feb 26. [[Class meeting for 10-605 Graphs 2|Randomized Algorithms]]
 
* Fri Feb 28.
 
** '''Non-streaming run on Hadoop of Naive Bayes due.'''
 
  
 
== March  ==
 
== March  ==
  
* Mon Mar 3. ''Guest Lecture: Garth Gibson, Cloud Computing and Programming Paradigms''  
+
* Sun Mar 1.
** Slides: [http://www.cs.cmu.edu/~wcohen/10-605/garth-Intro.pptx Intro], [http://www.cs.cmu.edu/~wcohen/10-605/garth-MapReduce_majd.pdf Mapreduce], [http://www.cs.cmu.edu/~wcohen/10-605/garth-Programming.pptx Programming], [http://www.cs.cmu.edu/~wcohen/10-605/garth-UseCases.pptx Use Cases]
+
** '''HW3 due: Naive Bayes with Hadoop MapReduce'''
* Wed Mar 5. ''Guest lecture: Alex Beutel, SGD on Hadoop''  
+
** HW4: [http://www.andrew.cmu.edu/user/amaurya/docs/10605/homework4.pdf PDF wrteup]
** [http://www.cs.cmu.edu/~wcohen/10-605/alex-beutel.pptx Slides]
+
* Tues Mar 3. ''student presentations''
* Fri Mar 7.
+
** Adams Wei Yu (weiyu at andrew): fast PPR on Map-Reduce [http://www.cs.cmu.edu/~wcohen/10-605/2015-guest-lecture/ppr_mapreduce.pdf]
** '''Hadoop assignment (phrase-finding) due'''
+
** Jakub Pachocki: factorization machines (and hash kernels?)  [http://www.cs.cmu.edu/~wcohen/10-605/2015-guest-lecture/FM.pdf]
* Mon Mar 10. ''no class - spring break.''
+
** <strike>Wanli Ma (wanlim at andrew): coresets for k-segmentation of streams</strike>
* Wed Mar 12. ''no class - spring break.''
+
* Thus Mar 5. ''student presentations''
* Mon Mar 17. [[Class meeting for 10-605 Subsample A Graph|Scalable PageRank]]
+
** Quiz: [https://qna-app.appspot.com/view.html?aglzfnFuYS1hcHByGQsSDFF1ZXN0aW9uTGlzdBiAgIDAvI2ZCAw]
** ''New Assignment: memory-efficient SGD'' [http://curtis.ml.cmu.edu/w/courses/images/0/08/Sgd.pdf PDF handout]
+
** Matt Gardner (mg1 at cs): Large-scale extensions of the path ranking algorithm [http://www.cs.cmu.edu/~wcohen/10-605/2015-guest-lecture/matt-805-presentation.pdf]
* Wed Mar 19. [[Class meeting for 10-605 Subsampling Graphs|Subsampling a graph with RWR]]
+
** Jesse Dodge (jessed at andrew): large-scale lasso regularization [http://www.cs.cmu.edu/~wcohen/10-605/2015-guest-lecture/jesse.pdf]
* Mon Mar 24. [[Class meeting for 10-605 SSL on Graphs|Subsamping continued and SSL on Graphs]]
+
** Ishan Misra (imisra at andrew): LSH for object detection [http://www.cs.cmu.edu/~wcohen/10-605/2015-guest-lecture/ishan.pdf]
* Wed Mar 26. [[Class meeting for 10-605 Spectral Clustering|Scalable spectral clustering techniques.]]
+
** ''HW5: memory-efficient SGD'' [http://www.andrew.cmu.edu/user/amaurya/docs/10605/homework5.pdf PDF handout]
** <strike>Assignment due: memory-efficient SGD</strike> delayed to Mon 3/31
+
** ''For 10/11-805 students:'' '''project proposal is due.'''  This must contain a complete description of the data you will use.
* Mon Mar 31. [[Class meeting for 10-605 LDA 1|Sparse sampling and parallelization for LDA]]
+
* Sat Mar 7 ('''extended from Friday'''):
** '''Assignment due: memory-efficient SGD'''
+
** '''HW4 due: Phrase-finding with Hadoop'''
** ''New Assignment: Subsampling and visualizing a graph.'' [http://curtis.ml.cmu.edu/w/courses/images/e/eb/ApproxPageRank.pdf PDF handout]
+
* Tues Mar 10. ''no class - spring break.''
 +
* Thus Mar 12. ''no class - spring break.''
 +
* Tues Mar 17. [[Class meeting for 10-605 Subsample A Graph|Scalable PageRank]] [http://curtis.ml.cmu.edu/w/courses/images/e/eb/ApproxPageRank.pdf PDF handout]
 +
* Thus Mar 19. [[Class meeting for 10-605 Subsampling Graphs|Subsampling a graph with RWR]]
 +
** '''HW5 due: memory-efficient SGD'''  
 +
** ''HW6: Subsampling and visualizing a graph. [http://bit.ly/605_hw6 PDF handout]
 +
* Tues Mar 24.  
 +
** Student presentation: Rohan Ramanath, Bayesian Optimization
 +
** Guest lecture: Dai Wei, CMU, Parameter servers. ('''Note''': This will be very relevant for one of the later HWs) [https://dl.dropboxusercontent.com/u/65353654/daiwei01_release.pdf PDF] and [https://dl.dropboxusercontent.com/u/65353654/daiwei01_release.pptx ppt].
 +
* Thus Mar 26. Guest lecture: D. Sculley, Google, TBA
 +
* Tues Mar 31. [[Class meeting for 10-605 LDA 1|Sparse sampling and parallelization for LDA]]
  
== April and May ==
+
== April and May ==
  
* Wed Apr 2. [[Class meeting for 10-605 2013 LDA 2|Speeding up LDA-like models: All-reduce and online LDA]]
+
* Wed April 1
* Mon Apr 7. [[Class meeting for 10-605 PIG|Workflows in PIG]]
+
** '''HW6 due: Subsampling and visualizing a graph.'''
* Wed Apr 9. [[Class meeting for 10-605 Similarity Joins|Fast KNN and similarity joins]]
+
** ''HW7: Matrix Factorization in Spark'' [http://www.andrew.cmu.edu/user/amaurya/docs/10605/homework7.pdf HW7 PDF Handout] [http://www.cs.cmu.edu/~yipeiw/TA605/hw7/eval2.pyc Evaluation Script][http://www.cs.cmu.edu/~yipeiw/TA605/hw7/eval_acc.py Validation Script]
* Mon Apr 14. [[Class meeting for 10-605 Parallel Similarity Joins|Parallel/Scalable Similarity Joins]]
+
* Thus Apr 2. [[Class meeting for 10-605 2013 LDA 2|Speeding up LDA-like models: All-reduce and other tricks]]
** '''Assignment due: Subsampling and visualizing a graph.'''
+
* Tues Apr 7. Guest lecture - Alex Beutel, SGD for Tensors
** ''New Assignment: Workflows with Pig'' [http://curtis.ml.cmu.edu/w/courses/images/4/46/Nb_pig.pdf PDF handout]
+
** [http://www.cs.cmu.edu/~wcohen/10-605/2015-guest-lecture/beutel.pptx Alex's slides]
* Wed Apr 16.  [[Class meeting for 10-605 First-Order Logics|First-order logics]]
+
** William's [http://www.cs.cmu.edu/~wcohen/10-605/HintsForMF.pptx hints for HW7 in PPT],[http://www.cs.cmu.edu/~wcohen/10-605/HintsForMF.pdf Hints for HW7 in PDF]
* Mon Apr 21. [[Class meeting for 10-605 Scalable FOL|Scalable First-order logics]]
+
* Thus Apr 9. Guest lecture - Alex Smola, [http://www.cs.cmu.edu/~wcohen/10-605/2015-guest-lecture/smola-param-serve.pdf Scalable parameter servers]
* Wed Apr 23.  [[Class meeting for 10-605 GraphLab|Graph models for large-scale ML]]
+
** If you don't like the MediaTech one, a [http://youtu.be/bFnUeYDBtbk Youtube video on is also available] for Alex's talk.
** '''Assignment due: Workflows with Pig'''
+
* Mon Apr 13. '''Informal update due for students working on project teams due.'''
* Mon Apr 28. Exam review session.  
+
** Each '''student working on a project''' should send to wcohen+805@gmail.com an update, between 1/2 page and 1 page long, saying what concrete tasks you've accomplished to date, how these tasks are part of the overall project (if you're not the only member), and what you plan to do between 4/13 and the presentation on 4/23.
** [http://curtis.ml.cmu.edu/w/courses/images/0/0a/Practice_questions.pdf PDF practice questions]
+
** Additionally, each '''project lead''' (i.e., each 805 student that has any 10-605 student working with them) should add a list of who's working on their project, and one line indicating if they're making good progress so far.
** [http://www.cs.cmu.edu/~wcohen/10-605/exam-review.pptx Review session slides]
+
* Tues Apr 14.  [[Class_meeting_for_10-605_SSL_on_Graphs|SSL on Graphs]]
* Wed Apr 30. In-class exam.
+
* Thus Apr 16. ''no class : carnival''
 +
** '''HW7 due'''
 +
** ''HW8: [http://bit.ly/605_hw8_ps Matrix factorization on parameter server]
 +
* Tues Apr 21.  [[Class meeting for 10-605 GraphLab|Graph models for large-scale ML]]
 +
* Thus Apr 23.  ''Presentation for 10/11-805 projects''
 +
* Tues Apr 28. Exam review session.  
 +
** '''HW8: due'''
 +
** [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/s2014-final.pdf PDF practice questions from 2014]
 +
** [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/s2015-final.pdf PDF practice questions for 2015]
 +
** [http://www.cs.cmu.edu/~wcohen/10-605/exam-review.pptx Review session slides],  [http://www.cs.cmu.edu/~wcohen/10-605/exam-review.pdf PDF]
 +
* Thus Apr 30. In-class exam.
 +
 
 +
* Tues May 5.
 +
** ''For 10/11-805 students:'' '''project reports''' are due
 +
 
 +
== Topics covered in previous years but not in 2015 ==
 +
 
 +
* [[Class meeting for 10-605 PIG|Workflows in PIG]]
 +
* [[Class meeting for 10-605 First-Order Logics|First-order logics]]
 +
* [[Class meeting for 10-605 Scalable FOL|Scalable First-order logics]]
 +
* [[Class meeting for 10-605 Parallel Similarity Joins|Scalable Similarity Joins]]
 +
* [[Class meeting for 10-605 Rocchio and On-line Learning|Messages, records and workflows; Rocchio]]
 +
* [[Class meeting for 10-605 Spectral Clustering|Scalable spectral clustering techniques.]]
 +
* [http://www.cs.cmu.edu/~wcohen/10-605/schimmy.pptx Scalable pagerank - The Schimmy Pattern]

Latest revision as of 14:50, 14 October 2015

This is the syllabus for Machine Learning with Large Datasets 10-605 in Spring 2015.

Notes:

  • The assignments posted are drafts based on the assignments from 2014, and will be modified over the course of the semester - some may be changed substantially.
  • Lecture notes and/or slides will be (re)posted around the time of the lectures.

January

February

March

  • Sun Mar 1.
    • HW3 due: Naive Bayes with Hadoop MapReduce
    • HW4: PDF wrteup
  • Tues Mar 3. student presentations
    • Adams Wei Yu (weiyu at andrew): fast PPR on Map-Reduce [1]
    • Jakub Pachocki: factorization machines (and hash kernels?) [2]
    • Wanli Ma (wanlim at andrew): coresets for k-segmentation of streams
  • Thus Mar 5. student presentations
    • Quiz: [3]
    • Matt Gardner (mg1 at cs): Large-scale extensions of the path ranking algorithm [4]
    • Jesse Dodge (jessed at andrew): large-scale lasso regularization [5]
    • Ishan Misra (imisra at andrew): LSH for object detection [6]
    • HW5: memory-efficient SGD PDF handout
    • For 10/11-805 students: project proposal is due. This must contain a complete description of the data you will use.
  • Sat Mar 7 (extended from Friday):
    • HW4 due: Phrase-finding with Hadoop
  • Tues Mar 10. no class - spring break.
  • Thus Mar 12. no class - spring break.
  • Tues Mar 17. Scalable PageRank PDF handout
  • Thus Mar 19. Subsampling a graph with RWR
    • HW5 due: memory-efficient SGD
    • HW6: Subsampling and visualizing a graph. PDF handout
  • Tues Mar 24.
    • Student presentation: Rohan Ramanath, Bayesian Optimization
    • Guest lecture: Dai Wei, CMU, Parameter servers. (Note: This will be very relevant for one of the later HWs) PDF and ppt.
  • Thus Mar 26. Guest lecture: D. Sculley, Google, TBA
  • Tues Mar 31. Sparse sampling and parallelization for LDA

April and May

  • Tues May 5.
    • For 10/11-805 students: project reports are due

Topics covered in previous years but not in 2015