Difference between revisions of "Syllabus for Machine Learning with Large Datasets 10-605 in Fall 2017"

From Cohen Courses
Jump to navigationJump to search
Line 18: Line 18:
 
* '''1:30-2:50pm Tues 12/5: Project presentations''' (in class).   
 
* '''1:30-2:50pm Tues 12/5: Project presentations''' (in class).   
 
* 11:59pm Sun 12/10: [[Machine_Learning_with_Large_Datasets_10-605_in_Fall_2016#Project_Info|Final 805 project writeup]] due.
 
* 11:59pm Sun 12/10: [[Machine_Learning_with_Large_Datasets_10-605_in_Fall_2016#Project_Info|Final 805 project writeup]] due.
 +
 +
=== 605 ICLR Reproducibilty Projects ===
  
 
605 Students can also, by permission, enter the [http://www.cs.mcgill.ca/~jpineau/ICLR2018-ReproducibilityChallenge.html ICLR reproducibility challenge].  The purpose of this project is to reproduce results from a paper submitted to ICLR-2018, and perform new experiments based on additional baselines, additional research questions (eg parameter sensitivity), or reimplementations of methods based on their published description.  An acceptable result might confirm submitted results or refute them - e.g., by showing that a well-tuned simpler baseline outperforms a newly proposed method.  Deadlines for this type of project are:
 
605 Students can also, by permission, enter the [http://www.cs.mcgill.ca/~jpineau/ICLR2018-ReproducibilityChallenge.html ICLR reproducibility challenge].  The purpose of this project is to reproduce results from a paper submitted to ICLR-2018, and perform new experiments based on additional baselines, additional research questions (eg parameter sensitivity), or reimplementations of methods based on their published description.  An acceptable result might confirm submitted results or refute them - e.g., by showing that a well-tuned simpler baseline outperforms a newly proposed method.  Deadlines for this type of project are:
* 11:59pm Mon 10/31: Declaration of the team, the paper being reproduced, and obtaining instructor permission to do a project.
+
* 11:59pm Mon 10/31: Declaration of the team, the paper being reproduced, and obtaining instructor permission to do a project. (For permission, team members should send their CVs.)
 
* 11:59pm Sun 11/12: First project report due, including motivation for studying this problem; the computational tools that will be needed, an assessment of how familiar the team members are with the tools; the plans for experimentation; and a precise estimate of the resources that will be needed for the experiments.  Students might want to consider use of CodaLab for experiments.
 
* 11:59pm Sun 11/12: First project report due, including motivation for studying this problem; the computational tools that will be needed, an assessment of how familiar the team members are with the tools; the plans for experimentation; and a precise estimate of the resources that will be needed for the experiments.  Students might want to consider use of CodaLab for experiments.
 
* 11:59pm Tues 11/28: Second project report due, including results from initial experiments.
 
* 11:59pm Tues 11/28: Second project report due, including results from initial experiments.

Revision as of 14:04, 20 October 2017

This is the syllabus for Machine Learning with Large Datasets 10-605 in Fall 2017.

Notes

  • Homeworks, unless otherwise posted, will be due when the next HW comes out.
  • Lecture notes and/or slides will be (re)posted around the time of the lectures.
  • Classes are cancelled for Sept 21 (Rosh Hashana)
  • No classes will be held on Nov 23 (Thanksgiving)

605/805 Project Schedule

Schedule for 805 projects:

605 ICLR Reproducibilty Projects

605 Students can also, by permission, enter the ICLR reproducibility challenge. The purpose of this project is to reproduce results from a paper submitted to ICLR-2018, and perform new experiments based on additional baselines, additional research questions (eg parameter sensitivity), or reimplementations of methods based on their published description. An acceptable result might confirm submitted results or refute them - e.g., by showing that a well-tuned simpler baseline outperforms a newly proposed method. Deadlines for this type of project are:

  • 11:59pm Mon 10/31: Declaration of the team, the paper being reproduced, and obtaining instructor permission to do a project. (For permission, team members should send their CVs.)
  • 11:59pm Sun 11/12: First project report due, including motivation for studying this problem; the computational tools that will be needed, an assessment of how familiar the team members are with the tools; the plans for experimentation; and a precise estimate of the resources that will be needed for the experiments. Students might want to consider use of CodaLab for experiments.
  • 11:59pm Tues 11/28: Second project report due, including results from initial experiments.
  • 11:59pm Sun 12/10: Final project writeup due, along with a pointer to a code repository (GitHub or similar) for the experiments.

Students doing this type of project will be graded the same as 605 students participating in a project, not leading a project (i.e., they complete 5/7 homeworks, and the project is 30% of their grade.) The first and second project reports are both worth 5 points, and the final report 20%.)

Schedule for lectures and 605 assignments

  • Tues Aug 29, 2017 Overview. Grading policies and etc, History of Big Data, Complexity theory and cost of important operations
  • Thurs Aug 31, 2017 Probability Review. Counting for big data and density estimation, streaming Naive Bayes, Rocchio and TFIDF
    • Start work on Assignment 1a: Streaming NB; writeup here
  • Tues Sep 5, 2017 Streaming Naive Bayes. Notes on scalable naive bayes, Alternatives to stream and sort, Local counting in stream and sort, Stream and sort examples
  • Thurs Sep 7, 2017 Hadoop Overview. Intro to Hadoop, Hadoop Streaming, Debugging Hadoop, Combiners
    • Start work on Assignment 1b: Streaming NB on Hadoop; writeup here
  • Tues Sep 12, 2017 Workflows For Hadoop 1. Scalable classification, Abstracts for map-reduce algorithms, Joins in Hadoop
  • Thurs Sep 14, 2017 Workflows For Hadoop 2. Guinea Pig intro, Similarity joins, Similarity joins with TFIDF
    • Start work on Assignment 2: Naive bayes testing in Guinea Pig; writeup here (Login to Autolab before following the link.)
  • Tues Sep 19, 2017 Workflows For Hadoop 3. PageRank, Spark, Phrase finding
  • Tues Sep 26, 2017 SGD and Hash Kernels. Learning as optimization, Logistic regression with SGD, Regularized SGD, Efficient regularized SGD, Hash kernels for logistic regression
  • Thurs Sep 28, 2017 Parallel Perceptrons 1. The "delta trick", Averaged perceptrons, Debugging ML algorithms
    • Start work on Assignment 3: scalable SGD; writeup here
  • Tues Oct 3, 2017 Parallel Perceptrons 2. Hash kernels, Ranking perceptrons
  • Thurs Oct 5, 2017 Parallel Perceptrons 3. Structured perceptrons, Interative parameter mixing paper
  • Tues Oct 10, 2017 SGD for MF. Matrix factorization, Matrix factorization with SGD, distributed matrix factorization with SGD
  • Thurs Oct 12, 2017 Midterm review and catchup. Midterm review
    • Last assignment due
  • Tues Oct 17, 2017 Midterm.
  • Thurs Oct 19, 2017 Computing with GPUs.
  • Tues Oct 24, 2017 Deep Learning 1. Deep learning intro, BackProp following Nielson, Expressiveness of MLPs, Deep learning and GPUs, Exploding and vanishing gradients, Modern deep learning models
  • Thurs Oct 26, 2017 Deep Learning 2. Reverse-mode differentiation, Some systems using autodiff, Details on Wengert lists, Breakdown of xman.py
  • Tues Oct 31, 2017 Deep Learning 3. Recursive ANNs, Convolutional ANNs
  • Thurs Nov 2, 2017 Randomized Algorithms 1. Bloom filters, The countmin sketch
  • Tues Nov 7, 2017 Randomized Algorithms 2. Review of Bloom filters, Locality sensitive hashing, Online LSH
    • Start work on Assignment 5: Autodiff with IPM part 2/2
  • Thurs Nov 9, 2017 Graph Architectures for ML. Graph-based ML architectures, Pregel, Signal-collect, GraphLab, PowerGraph, GraphChi, GraphX
  • Tues Nov 14, 2017 SSL on Graphs. Semi-supervised learning intro, Multirank-walk SSL method, Harmonic fields, Modified Adsorption SSL method, MAD with countmin sketches
    • Start work on Assignment 6: SSL on a graph in Spark maybe using NELL data?
  • Thurs Nov 16, 2017 Parameter Servers. Parameter servers, PS vs Hadoop, State Synchronous Parallel (SSP) model, Managed Communication in PS, LDA Sampler with PS
  • Tues Nov 21, 2017 LDA 1. DGMs for naive Bayes, Gibbs sampling for LDA
  • Tues Nov 28, 2017 LDA 2. Parallelizing LDA, Fast sampling for LDA, DGMs for graphs
  • Thurs Nov 30, 2017 Unsupervised Learning On Graphs. Spectral clustering, Power iteration clustering, Label propagation for clustering non-graph data, Label propagation for SSL on non-graph data
  • Tues Dec 5, 2017 Review session for final.
    • Last assignment due
  • Thurs Dec 7, 2017 Final Exam.