# Syllabus for Machine Learning with Large Datasets 10-605 in Fall 2017

This is the syllabus for Machine Learning with Large Datasets 10-605 in Fall 2017.

## Contents

### Notes

- Homeworks, unless otherwise posted, will be due when the next HW comes out.
- Lecture notes and/or slides will be (re)posted around the time of the lectures.
- Classes are cancelled for Sept 21 (Rosh Hashana)
**No classes will be held on Nov 23 (Thanksgiving)**

### 605/805 Project Schedule

Schedule for 805 projects:

- 11:59pm Sun 10/1: Initial 805 project proposal due.
- 605 students: Initial proposals from 805 students that are looking for 605 partners are here. Unfortunately there are only 4 such proposals this fall.

- 11:59pm Sun 10/15: Final 805 project proposal due.
- This is a revised writeup that will address any comments William raises from the initial proposal.

- 11:59pm Sun 11/12: Midterm 805 project report due.
**1:30-2:50pm Tues 12/5: Project presentations**(in class).- 11:59pm Sun 12/10: Final 805 project writeup due.

### 605 ICLR Reproducibilty Projects

605 Students can also, by permission, enter the ICLR reproducibility challenge. The purpose of this project is to reproduce results from a paper submitted to ICLR-2018, and perform new experiments based on additional baselines, additional research questions (eg parameter sensitivity), or reimplementations of methods based on their published description. An acceptable result might confirm submitted results or refute them - e.g., by showing that a well-tuned simpler baseline outperforms a newly proposed method. Deadlines for this type of project are:

- 11:59pm Thus 10/26: Get permission to form a team. For permission, should send their CVs in one email with the CVs to William with the subject "Reproducibility project team". Teams should be two people, but I may allow singleton teams and threesomes if there's a good reason. Send your team proposal earlier if you can.
- 11:59pm Mon 10/31: Declaration of the team, the paper being reproduced, and obtaining instructor permission to do a project.
- 11:59pm Sun 11/12: First project report due, including motivation for studying this problem; the computational tools that will be needed, an assessment of how familiar the team members are with the tools; the plans for experimentation; and a precise estimate of the resources that will be needed for the experiments. Students might want to consider use of CodaLab for experiments.
- 11:59pm Tues 11/28: Second project report due, including results from initial experiments.
- 11:59pm Sun 12/10: Final project writeup due, along with a pointer to a code repository (GitHub or similar) for the experiments.

Students doing this type of project will be graded **the same as 605 students participating in a project**, not leading a project (i.e., they complete 5/7 homeworks, and the project is 30% of their grade.) **The first and second project reports are both worth 5 points, and the final report 20%**.)

### Schedule for lectures and 605 assignments

- Tues Aug 29, 2017 Overview. Grading policies and etc, History of Big Data, Complexity theory and cost of important operations
- Thurs Aug 31, 2017 Probability Review. Counting for big data and density estimation, streaming Naive Bayes, Rocchio and TFIDF
**Start work on**Assignment 1a: Streaming NB; writeup here

- Tues Sep 5, 2017 Streaming Naive Bayes. Notes on scalable naive bayes, Alternatives to stream and sort, Local counting in stream and sort, Stream and sort examples
- Thurs Sep 7, 2017 Hadoop Overview. Intro to Hadoop, Hadoop Streaming, Debugging Hadoop, Combiners
**Start work on**Assignment 1b: Streaming NB on Hadoop; writeup here

- Tues Sep 12, 2017 Workflows For Hadoop 1. Scalable classification, Abstracts for map-reduce algorithms, Joins in Hadoop
- Thurs Sep 14, 2017 Workflows For Hadoop 2. Guinea Pig intro, Similarity joins, Similarity joins with TFIDF
**Start work on**Assignment 2: Naive bayes testing in Guinea Pig; writeup here (Login to Autolab before following the link.)

- Tues Sep 19, 2017 Workflows For Hadoop 3. PageRank, Spark, Phrase finding
- Tues Sep 26, 2017 SGD and Hash Kernels. Learning as optimization, Logistic regression with SGD, Regularized SGD, Efficient regularized SGD, Hash kernels for logistic regression
- Thurs Sep 28, 2017 Parallel Perceptrons 1. The "delta trick", Averaged perceptrons, Debugging ML algorithms
**Start work on**Assignment 3: scalable SGD; writeup here

- Tues Oct 3, 2017 Parallel Perceptrons 2. Hash kernels, Ranking perceptrons
- Thurs Oct 5, 2017 Parallel Perceptrons 3. Structured perceptrons, Interative parameter mixing paper
- Tues Oct 10, 2017 SGD for MF. Matrix factorization, Matrix factorization with SGD, distributed matrix factorization with SGD
- Thurs Oct 12, 2017 Midterm review and catchup. Midterm review
**Last assignment due**

- Tues Oct 17, 2017 Midterm.
- Thurs Oct 19, 2017 Computing with GPUs.
- Tues Oct 24, 2017 Deep Learning 1. Deep learning intro, BackProp following Nielson, Expressiveness of MLPs, Deep learning and GPUs, Exploding and vanishing gradients, Modern deep learning models
- Thurs Oct 26, 2017 Deep Learning 2. Reverse-mode differentiation, Some systems using autodiff, Details on Wengert lists, Breakdown of xman.py
**Start work on**Assignment 4: Autodiff with IPM part 1/2; writeup here

- Tues Oct 31, 2017 Deep Learning 3. Recursive ANNs, Convolutional ANNs
- Thurs Nov 2, 2017 Randomized Algorithms 1. Bloom filters, The countmin sketch, CM Sketches in Deep Learning
- Tues Nov 7, 2017 Randomized Algorithms 2. Review of Bloom filters, Locality sensitive hashing, Online LSH
**Start work on**Assignment 5: Autodiff with IPM part 2/2; writeup here

- Thurs Nov 9, 2017 Graph Architectures for ML. Graph-based ML architectures, Pregel, Signal-collect, GraphLab, PowerGraph, GraphChi, GraphX
- Tues Nov 14, 2017 SSL on Graphs. Semi-supervised learning intro, Multirank-walk SSL method, Harmonic fields, Modified Adsorption SSL method, MAD with countmin sketches
**Start work on**Assignment 6: Label Propagation with Spark; writeup here

- Thurs Nov 16, 2017 LDA 1. DGMs for naive Bayes, Gibbs sampling for LDA
- Tues Nov 21, 2017 LDA 2. Parallelizing LDA, Fast sampling for LDA, DGMs for graphs
- Tues Nov 28, 2017 Parameter Servers. Parameter servers, PS vs Hadoop, State Synchronous Parallel (SSP) model, Managed Communication in PS, LDA Sampler with PS
**Start work on**Assignment 7: LDA with a Parameter Server; writeup here

- Thurs Nov 30, 2017 Unsupervised Learning On Graphs. Spectral clustering, Power iteration clustering, Label propagation for clustering non-graph data, Label propagation for SSL on non-graph data
- Tues Dec 5, 2017 Project presentations and review for final.
**Last assignment due**

- Thurs Dec 7, 2017 Final Exam.