Difference between revisions of "Syllabus for Machine Learning with Large Datasets 10-605 in Fall 2015"

From Cohen Courses
Jump to navigationJump to search
(Created page with "This is the syllabus for Machine Learning with Large Datasets 10-605 in Fall 2015. Notes: * The assignments posted are '''drafts''' based on the assignments from sprin...")
 
 
(78 intermediate revisions by 6 users not shown)
Line 2: Line 2:
  
 
Notes:  
 
Notes:  
* The assignments posted are '''drafts''' based on the assignments from spring 2015, and will be modified over the course of the semester - some may be changed substantially.
+
* Homeworks, unless otherwise posted, will be due when the next HW comes out.
 
* Lecture notes and/or slides will be (re)posted around the time of the lectures.
 
* Lecture notes and/or slides will be (re)posted around the time of the lectures.
  
== September ==
+
Schedule:
 
 
 
* Tues Sep 1. [[Class meeting for 10-605 Overview|Overview of course, cost of various operations, asymptotic analysis.]]
 
* Tues Sep 1. [[Class meeting for 10-605 Overview|Overview of course, cost of various operations, asymptotic analysis.]]
 
* Thus Sep 3. [[Class meeting for 10-605 Probability Review|Review of probabilities, joint distributions and naive Bayes]]
 
* Thus Sep 3. [[Class meeting for 10-605 Probability Review|Review of probabilities, joint distributions and naive Bayes]]
 
* Tues Sep 8.  [[Class meeting for 10-605 Streaming Naive Bayes|Streaming algorithms and Naive Bayes; The stream-and-sort design pattern; Naive Bayes for large feature sets.]]
 
* Tues Sep 8.  [[Class meeting for 10-605 Streaming Naive Bayes|Streaming algorithms and Naive Bayes; The stream-and-sort design pattern; Naive Bayes for large feature sets.]]
* Thus Sep 10. [[Class meeting for 10-605 Phase Finding|Messages, records and workflows; Phrase finding.]]
+
** HW1 out: streaming naive Bayes in Java. [https://s3.amazonaws.com/vincy/10605-15Fall/HW1_StreamingNB.pdf PDF Handout]
* Tues Sep 15. [[Class meeting for 10-605 Hadoop 1|Hadoop and Map-Reduce]]
+
* Thus Sep 10. [[Class meeting for 10-605 Phrase Finding|Phrase Finding]]
* Thus Sep 17. [[Class meeting for 10-605 PIG|PIG and Other Workflow Systems for Hadoop]]
+
* Tues Sep 15. [[Class meeting for 10-605 Phrases_with_Stream_and_Sort|Implementing Phrase Finding and Large-Data Testing for Naive Bayes with Stream-and-Sort]].
* Tues Sep 22. [[Class_meeting_for_10-605_Rocchio_and_On-line_Learning|Rocchio and TFIDF]]
+
** Lecture also discusses: map-reduce abstractions/dataflow
* Thus Sep 24. [[Class meeting for 10-605 Similarity Joins|Fast KNN and similarity joins]]
+
** Also: Guest lecture from Manik Varma, MSR.
* Tues Sep 29. [[Class meeting for 10-605 Parallel Perceptrons 1|Parallel Perceptrons 1]].
+
* Thus Sep 17. [[Class_meeting_for_10-605_Hadoop_Overview|Hadoop Overview]]
* Thus Sep 30. [[Class meeting for 10-605 Parallel Perceptrons 2|Parallel Perceptrons 2]].
+
** HW2 out: naive Bayes training on Hadoop in Java. [https://drive.google.com/file/d/0BzQQ-spWKjhUd0NXSTB6TW82LWM/view PDF Handout]
 
+
* Tues Sep 22 - Thus Sep 24. [[Class_meeting_for_10-605_Rocchio_and_Hadoop_Workflows|Hadoop Workflow Languages and Rocchio and TFIDF]]
== October ==
+
** Lecture also discusses: hadoop streaming, mrjob, cascading, pipes, scaling, hive, pig, spark, flink
  
* Tues Feb 17. [[Class meeting for 10-605 SGD and Hash Kernels|Scalable SGD and Hash Kernels]]
+
----
** ''HW3: Naive Bayes with Hadoop MapReduce''.  PDF Handouts: [http://www.andrew.cmu.edu/user/amaurya/docs/10605/homework3.pdf  HW3].
 
** ''For 10/11-805 students:'' '''initial draft of project proposal is due.'''  I will give you feedback on this, so please be clear about your proposal.  I'm expecting approximately one page.  You should discuss what dataset you plan to use, what results you hope to obtain, what baseline technique you will build on and/or compare to.  Also include a section saying if you have a partner; and if you are willing to work with/mentor one or more 605 students, and if so, how you anticipate them contributing to the project.
 
* Thus Feb 19. [[Class meeting for 10-605 Randomized|Randomized Algorithms 1]]
 
* Tues Feb 24. [[Class meeting for 10-605 Randomized|Randomized Algorithms 2]]
 
* Thus Feb 26. [[Class meeting for 10-605 SGD for MF|Matrix Factorization and SGD]]
 
  
== March ==
+
* Tues Sep 29. [[Class meeting for 10-605 Similarity Joins|Fast KNN and similarity joins]]
 +
** HW3 out: Naive Bays in GuineaPig. [https://drive.google.com/file/d/0B-p8_eIVeEHFM1JOSGFWNFFJcU0/view PDF Handout]
 +
* Thus Oct 1. [[Class meeting for 10-605 SGD and Hash Kernels|Scalable SGD and Hash Kernels]]
 +
** For 805 students: an initial project proposal is due '''via email to wcohen+805@gmail.com'''. You will get feedback on it from the instructors, and it will also be posted to the class - mainly for 605 students that are interested in collaborating, but also for general interest.  Please be clear about your proposal. I'm expecting approximately one page. You should discuss what dataset you plan to use, what results you hope to obtain, what baseline technique you will build on and/or compare to. Also include a section saying if you have a partner; and if you are willing to work with/mentor one or more 605 students, and if so, how you anticipate them contributing to the project.
 +
* Tues Oct 6. [[Class meeting for 10-605 Parallel Perceptrons 1|Parallel Perceptrons 1]].
 +
* Thus Oct 8. [[Class meeting for 10-605 Parallel Perceptrons 2|Parallel Perceptrons 2]].
 +
* Tues Oct 13. [[Class meeting for 10-605 Advanced topics for SGD|More on parallel and streaming ML]]: Adaptive gradients, AllReduce, and Parameter Servers
 +
** HW4 out: streaming logistic regression classifier [http://curtis.ml.cmu.edu/w/courses/images/8/86/Sgd_fall15.pdf PDF Handout]
 +
* Thus Oct 15. [[Class meeting for 10-605 SGD for MF|Matrix Factorization and SGD]]
 +
** For 805 students: the final project proposal is due.
 +
* Tues Oct 20. Exam review tips ([http://www.cs.cmu.edu/~wcohen/10-605/midterm-review.pptx ppt], [http://www.cs.cmu.edu/~wcohen/10-605/midterm-review.pdf pdf]) and guest lecture from '''Mark Torrance of RocketFuel'''
 +
* Thus Oct 22. ''midterm exam''
 +
** [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/f2015-midterm.pdf practice questions for midterm - from 2015].  This document also identicies relevant questions from two previous review sheets:
 +
*** [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/s2014-final.pdf practice questions from final, 2014]
 +
*** [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/s2015-final.pdf practice questions for final, 2015]
 +
*** [http://www.cs.cmu.edu/~wcohen/10-605/midterm-review.pdf Some review tips - modified from last year's exam review session]
  
* Sun Mar 1.
+
----
** '''HW3 due: Naive Bayes with Hadoop MapReduce'''
 
** HW4: [http://www.andrew.cmu.edu/user/amaurya/docs/10605/homework4.pdf PDF wrteup]
 
* Tues Mar 3. ''student presentations''
 
** Adams Wei Yu (weiyu at andrew): fast PPR on Map-Reduce [http://www.cs.cmu.edu/~wcohen/10-605/2015-guest-lecture/ppr_mapreduce.pdf]
 
** Jakub Pachocki: factorization machines (and hash kernels?)  [http://www.cs.cmu.edu/~wcohen/10-605/2015-guest-lecture/FM.pdf]
 
** <strike>Wanli Ma (wanlim at andrew): coresets for k-segmentation of streams</strike>
 
* Thus Mar 5. ''student presentations''
 
** Quiz: [https://qna-app.appspot.com/view.html?aglzfnFuYS1hcHByGQsSDFF1ZXN0aW9uTGlzdBiAgIDAvI2ZCAw]
 
** Matt Gardner (mg1 at cs): Large-scale extensions of the path ranking algorithm [http://www.cs.cmu.edu/~wcohen/10-605/2015-guest-lecture/matt-805-presentation.pdf]
 
** Jesse Dodge (jessed at andrew): large-scale lasso regularization [http://www.cs.cmu.edu/~wcohen/10-605/2015-guest-lecture/jesse.pdf]
 
** Ishan Misra (imisra at andrew): LSH for object detection [http://www.cs.cmu.edu/~wcohen/10-605/2015-guest-lecture/ishan.pdf]
 
** ''HW5: memory-efficient SGD'' [http://www.andrew.cmu.edu/user/amaurya/docs/10605/homework5.pdf PDF handout]
 
** ''For 10/11-805 students:'' '''project proposal is due.'''  This must contain a complete description of the data you will use.
 
* Sat Mar 7 ('''extended from Friday'''):
 
** '''HW4 due: Phrase-finding with Hadoop'''
 
* Tues Mar 10. ''no class - spring break.''
 
* Thus Mar 12. ''no class - spring break.''
 
* Tues Mar 17. [[Class meeting for 10-605 Subsample A Graph|Scalable PageRank]] [http://curtis.ml.cmu.edu/w/courses/images/e/eb/ApproxPageRank.pdf PDF handout]
 
* Thus Mar 19. [[Class meeting for 10-605 Subsampling Graphs|Subsampling a graph with RWR]]
 
** '''HW5 due: memory-efficient SGD'''
 
** ''HW6: Subsampling and visualizing a graph.  [http://bit.ly/605_hw6 PDF handout]
 
* Tues Mar 24.
 
** Student presentation: Rohan Ramanath, Bayesian Optimization
 
** Guest lecture: Dai Wei, CMU, Parameter servers.  ('''Note''': This will be very relevant for one of the later HWs) [https://dl.dropboxusercontent.com/u/65353654/daiwei01_release.pdf PDF] and [https://dl.dropboxusercontent.com/u/65353654/daiwei01_release.pptx ppt].
 
* Thus Mar 26. Guest lecture: D. Sculley, Google, TBA
 
* Tues Mar 31. [[Class meeting for 10-605 LDA 1|Sparse sampling and parallelization for LDA]]
 
  
== April and May ==
+
* Tues Oct 27. [[Class meeting for 10-605 Randomized|Randomized Algorithms 1]]
 +
* Thus Oct 29. [[Class meeting for 10-605 Randomized|Randomized Algorithms 2]]
 +
** HW5 out: dSGD for modeling text ([https://drive.google.com/file/d/0BzQQ-spWKjhUYUM1LUVZakx0ZlE/view])
 +
* Tues Nov 3. Finish up with randomized algorithms.
 +
* Thus Nov 5. [[Class meeting for 10-605 Subsample A Graph|Scalable PageRank]]
 +
* Tues Nov 10. [[Class_meeting_for_10-605_SSL_on_Graphs|SSL on Graphs]]
 +
* Thus Nov 12. [[Class meeting for 10-605 LDA 1|Sparse sampling and parallelization for LDA]]
 +
** HW6 out: approximate pagerank for sampling a graph ([https://goo.gl/ThtRc6])
 +
* Tues Nov 17.  ''Guest lecture, Chris Dyer.'' [http://demo.clab.cs.cmu.edu/cdyer/bigdata-cuda.pdf Learning with GPUs].
 +
* Thus Nov 19. ''Guest lecture: Aurick Qiao'', parameter servers [http://curtis.ml.cmu.edu/w/courses/images/8/85/Aurick_release.pptx ppt slides].
 +
* Tues Nov 24. [[Class meeting for 10-605 2013 LDA 2|Speeding up LDA-like models: All-reduce and other tricks]]
 +
** HW7 out: LDA with a param server ([http://curtis.ml.cmu.edu/w/courses/images/1/16/Hw7-lda-ps.pdf PDF handout])
 +
* Thus Nov 26. ''Happy Thanksgiving!''
  
* Wed April 1
+
----
** '''HW6 due: Subsampling and visualizing a graph.'''
 
** ''HW7: Matrix Factorization in Spark'' [http://www.andrew.cmu.edu/user/amaurya/docs/10605/homework7.pdf HW7 PDF Handout] [http://www.cs.cmu.edu/~yipeiw/TA605/hw7/eval2.pyc Evaluation Script][http://www.cs.cmu.edu/~yipeiw/TA605/hw7/eval_acc.py Validation Script]
 
* Thus Apr 2. [[Class meeting for 10-605 2013 LDA 2|Speeding up LDA-like models: All-reduce and other tricks]]
 
* Tues Apr 7. Guest lecture - Alex Beutel, SGD for Tensors
 
** [http://www.cs.cmu.edu/~wcohen/10-605/2015-guest-lecture/beutel.pptx Alex's slides]
 
** William's [http://www.cs.cmu.edu/~wcohen/10-605/HintsForMF.pptx hints for HW7 in PPT],[http://www.cs.cmu.edu/~wcohen/10-605/HintsForMF.pdf Hints for HW7 in PDF]
 
* Thus Apr 9. Guest lecture - Alex Smola, [http://www.cs.cmu.edu/~wcohen/10-605/2015-guest-lecture/smola-param-serve.pdf Scalable parameter servers]
 
** If you don't like the MediaTech one, a [http://youtu.be/bFnUeYDBtbk Youtube video on is also available] for Alex's talk.
 
* Mon Apr 13. '''Informal update due for students working on project teams due.'''
 
** Each '''student working on a project''' should send to wcohen+805@gmail.com an update, between 1/2 page and 1 page long, saying what concrete tasks you've accomplished to date, how these tasks are part of the overall project (if you're not the only member), and what you plan to do between 4/13 and the presentation on 4/23. 
 
** Additionally, each '''project lead''' (i.e., each 805 student that has any 10-605 student working with them) should add a list of who's working on their project, and one line indicating if they're making good progress so far.
 
* Tues Apr 14.  [[Class_meeting_for_10-605_SSL_on_Graphs|SSL on Graphs]]
 
* Thus Apr 16. ''no class : carnival''
 
** '''HW7 due'''
 
** ''HW8: [http://bit.ly/605_hw8_ps Matrix factorization on parameter server]
 
* Tues Apr 21.  [[Class meeting for 10-605 GraphLab|Graph models for large-scale ML]]
 
* Thus Apr 23.  ''Presentation for 10/11-805 projects''
 
* Tues Apr 28. Exam review session.
 
** '''HW8: due'''
 
** [http://curtis.ml.cmu.edu/w/courses/images/0/0a/Practice_questions.pdf PDF practice questions from 2014]
 
** [http://www.cs.cmu.edu/~wcohen/10-605/605_sample_questions.pdf PDF practice questions for 2015]
 
** [http://www.cs.cmu.edu/~wcohen/10-605/exam-review.pptx Review session slides],  [http://www.cs.cmu.edu/~wcohen/10-605/exam-review.pdf PDF]
 
* Thus Apr 30. In-class exam.
 
  
* Tues May 5.
+
* Tues Dec 1, Thus Dec 3. [[Class meeting for 10-605 GraphLab|Graph models for large-scale ML]]
** ''For 10/11-805 students:'' '''project reports''' are due
+
* Tues Dec 8.  Review and project presentations (15 min each):
 +
** Schedule:
 +
*** Bhuwan Dingra/Yun Fu
 +
*** Rohit Girdhar
 +
*** Siddha Ganju/Sravya Popuri/Srikant Avasarala
 +
*** Jingkun Gao/Yiming Gu
 +
** HW7 due
 +
* Thus Dec 10.  In-class final exam.
 +
* Tues Dec 15.  Writeup for 10-805 projects are due (at 11:59pm).
  
 
== Topics covered in previous years but not in 2015 ==
 
== Topics covered in previous years but not in 2015 ==
  
 +
*  [[Class meeting for 10-605 Scalable FOL|Scalable First-order logics]]
 
* [[Class meeting for 10-605 PIG|Workflows in PIG]]
 
* [[Class meeting for 10-605 PIG|Workflows in PIG]]
* [[Class meeting for 10-605 First-Order Logics|First-order logics]]
+
* [[Class meeting for 10-605 Phase Finding|Phrase Finding]]
* [[Class meeting for 10-605 Scalable FOL|Scalable First-order logics]]
 
 
* [[Class meeting for 10-605 Parallel Similarity Joins|Scalable Similarity Joins]]
 
* [[Class meeting for 10-605 Parallel Similarity Joins|Scalable Similarity Joins]]
 +
* [[Class meeting for 10-605 Similarity Joins|Fast KNN and similarity joins]]
 
* [[Class meeting for 10-605 Rocchio and On-line Learning|Messages, records and workflows; Rocchio]]
 
* [[Class meeting for 10-605 Rocchio and On-line Learning|Messages, records and workflows; Rocchio]]
 +
* [http://www.cs.cmu.edu/~wcohen/10-605/schimmy.pptx Scalable pagerank - The Schimmy Pattern]
 
* [[Class meeting for 10-605 Spectral Clustering|Scalable spectral clustering techniques.]]
 
* [[Class meeting for 10-605 Spectral Clustering|Scalable spectral clustering techniques.]]
* [http://www.cs.cmu.edu/~wcohen/10-605/schimmy.pptx Scalable pagerank - The Schimmy Pattern]
 

Latest revision as of 10:07, 11 October 2016

This is the syllabus for Machine Learning with Large Datasets 10-605 in Fall 2015.

Notes:

  • Homeworks, unless otherwise posted, will be due when the next HW comes out.
  • Lecture notes and/or slides will be (re)posted around the time of the lectures.

Schedule:




  • Tues Dec 1, Thus Dec 3. Graph models for large-scale ML
  • Tues Dec 8. Review and project presentations (15 min each):
    • Schedule:
      • Bhuwan Dingra/Yun Fu
      • Rohit Girdhar
      • Siddha Ganju/Sravya Popuri/Srikant Avasarala
      • Jingkun Gao/Yiming Gu
    • HW7 due
  • Thus Dec 10. In-class final exam.
  • Tues Dec 15. Writeup for 10-805 projects are due (at 11:59pm).

Topics covered in previous years but not in 2015