# Syllabus for Machine Learning with Large Datasets 10-605 in Fall 2015

From Cohen Courses

Jump to navigationJump to searchThis is the syllabus for Machine Learning with Large Datasets 10-605 in Fall 2015.

Notes:

- Homeworks, unless otherwise posted, will be due when the next HW comes out.
- Lecture notes and/or slides will be (re)posted around the time of the lectures.

Schedule:

- Tues Sep 1. Overview of course, cost of various operations, asymptotic analysis.
- Thus Sep 3. Review of probabilities, joint distributions and naive Bayes
- Tues Sep 8. Streaming algorithms and Naive Bayes; The stream-and-sort design pattern; Naive Bayes for large feature sets.
- HW1 out: streaming naive Bayes in Java. PDF Handout

- Thus Sep 10. Phrase Finding
- Tues Sep 15. Implementing Phrase Finding and Large-Data Testing for Naive Bayes with Stream-and-Sort.
- Lecture also discusses: map-reduce abstractions/dataflow
- Also: Guest lecture from Manik Varma, MSR.

- Thus Sep 17. Hadoop Overview
- HW2 out: naive Bayes training on Hadoop in Java. PDF Handout

- Tues Sep 22 - Thus Sep 24. Hadoop Workflow Languages and Rocchio and TFIDF
- Lecture also discusses: hadoop streaming, mrjob, cascading, pipes, scaling, hive, pig, spark, flink

- Tues Sep 29. Fast KNN and similarity joins
- HW3 out: Naive Bays in GuineaPig. PDF Handout

- Thus Oct 1. Scalable SGD and Hash Kernels
- For 805 students: an initial project proposal is due
**via email to wcohen+805@gmail.com**. You will get feedback on it from the instructors, and it will also be posted to the class - mainly for 605 students that are interested in collaborating, but also for general interest. Please be clear about your proposal. I'm expecting approximately one page. You should discuss what dataset you plan to use, what results you hope to obtain, what baseline technique you will build on and/or compare to. Also include a section saying if you have a partner; and if you are willing to work with/mentor one or more 605 students, and if so, how you anticipate them contributing to the project.

- For 805 students: an initial project proposal is due
- Tues Oct 6. Parallel Perceptrons 1.
- Thus Oct 8. Parallel Perceptrons 2.
- Tues Oct 13. More on parallel and streaming ML: Adaptive gradients, AllReduce, and Parameter Servers
- HW4 out: streaming logistic regression classifier PDF Handout

- Thus Oct 15. Matrix Factorization and SGD
- For 805 students: the final project proposal is due.

- Tues Oct 20. Exam review tips (ppt, pdf) and guest lecture from
**Mark Torrance of RocketFuel** - Thus Oct 22.
*midterm exam*- practice questions for midterm - from 2015. This document also identicies relevant questions from two previous review sheets:

- Tues Oct 27. Randomized Algorithms 1
- Thus Oct 29. Randomized Algorithms 2
- HW5 out: dSGD for modeling text ([1])

- Tues Nov 3. Finish up with randomized algorithms.
- Thus Nov 5. Scalable PageRank
- Tues Nov 10. SSL on Graphs
- Thus Nov 12. Sparse sampling and parallelization for LDA
- HW6 out: approximate pagerank for sampling a graph ([2])

- Tues Nov 17.
*Guest lecture, Chris Dyer.*Learning with GPUs. - Thus Nov 19.
*Guest lecture: Aurick Qiao*, parameter servers ppt slides. - Tues Nov 24. Speeding up LDA-like models: All-reduce and other tricks
- HW7 out: LDA with a param server (PDF handout)

- Thus Nov 26.
*Happy Thanksgiving!*

- Tues Dec 1, Thus Dec 3. Graph models for large-scale ML
- Tues Dec 8. Review and project presentations (15 min each):
- Schedule:
- Bhuwan Dingra/Yun Fu
- Rohit Girdhar
- Siddha Ganju/Sravya Popuri/Srikant Avasarala
- Jingkun Gao/Yiming Gu

- HW7 due

- Schedule:
- Thus Dec 10. In-class final exam.
- Tues Dec 15. Writeup for 10-805 projects are due (at 11:59pm).