Difference between revisions of "Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018"

Revision as of 10:47, 21 February 2018

This is the syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018.

Ideas for extensions to the HW assignments

This is not a complete list! you can use any of these as a starting point, but feel free to think up your own extensions.

HW2 (NB in GuineaPig):

The assignment proposes one particular scheme for parallelizing the training/testing algorithm. Consider another parallelization algorithm.
Implement a similarly scalable Rocchio algorithm and compare it with NB.
Reimplement the same algorithm in Spark (or some other dataflow language) and compare.

HW3 (Logistic regression and SGD)

Evaluate the hash trick for Naive Bayes systematically on a series of datasets.
Implement a parameter-mixing version of logistic regression and evaluate it.
A recent paper proposes (roughly) using SVM with NB-transformed features. Implement this and compare.
The personalization method described in class is based on a transfer learning method which works similarly. Many wikipedia pages are available in multiple languages, and works in related languages tend to be lexically similar (eg, "astrónomo" is Spanish for "astronomer"). Suppose features were character n-grams (eg "astr", "stro", "tron", ...) - does domain transfer work for the task of classifying wikipedia pages? Construct a dataset and experiment to test this hypothesis.

Notes

Homeworks, unless otherwise posted, will be due when the next HW comes out.
Lecture notes and/or slides will be (re)posted around the time of the lectures.

Schedule

Wed Jan 17, 2018 Overview. Grading policies and etc, History of Big Data, Complexity theory and cost of important operations
Mon Jan 22, 2018 Probability Review. Counting for big data and density estimation, streaming Naive Bayes, Rocchio and TFIDF
- Start work on Assignment 1a: Streaming NB; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw1a.pdf
Wed Jan 24, 2018 Streaming Naive Bayes. Notes on scalable naive bayes, Alternatives to stream and sort, Local counting in stream and sort, Stream and sort examples
Mon Jan 29, 2018 Hadoop Overview. Intro to Hadoop, Hadoop Streaming, Debugging Hadoop, Combiners
- Start work on Assignment 1b: Streaming NB on Hadoop; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw1b.pdf
Wed Jan 31, 2018 Workflows For Hadoop 1. Scalable classification, Abstracts for map-reduce algorithms, Joins in Hadoop
Mon Feb 5, 2018 Workflows For Hadoop 2. Guinea Pig intro, Similarity joins, Similarity joins with TFIDF, Parallel simjoins
- Start work on Assignment 2a: Naive bayes training in Guinea Pig; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw2a.pdf
Wed Feb 7, 2018 Workflows For Hadoop 3. PageRank, PageRank in Pig and Guinea Pig, K-means in Pig, Spark, Systems built on top of Hadoop
Mon Feb 12, 2018 SGD and Hash Kernels. Learning as optimization, Logistic regression with SGD, Regularized SGD, Efficient regularized SGD, Hash kernels for logistic regression
- Start work on Assignment 2b: Naive bayes testing in Guinea Pig. http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw2b.pdf
Wed Feb 14, 2018 Parallel Perceptrons 1. The "delta trick", Averaged perceptrons, Debugging ML algorithms
Mon Feb 19, 2018 Parallel Perceptrons 2. Hash kernels, Ranking perceptrons, Structured perceptrons
- Start work on Assignment 3: scalable SGD; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw3.pdf
Wed Feb 21, 2018 Parallel Perceptrons 3. Iterative parameter mixing paper, Parallel SGD via Param Mixing
Mon Feb 26, 2018 SGD for MF. Matrix factorization, Matrix factorization with SGD, distributed matrix factorization with SGD
Wed Feb 28, 2018 Guest lecture - Kijung Shin.
Mon Mar 5, 2018 Midterm review and catchup. Midterm review
- Previous assignment due
Wed Mar 7, 2018 Midterm.
Mon Mar 19, 2018 Computing with GPUs. Introduction to GPUs, CUDA, Vectorization
- Start work on Assignment 4: Autodiff with IPM part 1/2; Draft at http://www.cs.cmu.edu/~wcohen/10-405/assignments/2016-fall/hw-5-autodiff/main.pdf
Wed Mar 21, 2018 Deep Learning 1. Deep learning intro, BackProp following Nielson, Expressiveness of MLPs, Deep learning and GPUs, Exploding and vanishing gradients, Modern deep learning models
Mon Mar 26, 2018 Deep Learning 2. Reverse-mode differentiation (autodiff), Some systems using autodiff, Details on Wengert lists, Breakdown of xman.py
Wed Mar 28, 2018 Deep Learning 3. Inputs, parameters, updates, Word2vec and GloVE, Recursive ANNs, Convolutional ANNs, Achitectures using RNNs
Mon Apr 2, 2018 Randomized Algorithms 1. Bloom filters, The countmin sketch, CM Sketches in Deep Learning
- Start work on Assignment 5: Autodiff with IPM part 2/2
Wed Apr 4, 2018 Randomized Algorithms 2. Review of Bloom filters, Locality sensitive hashing, Online LSH
Mon Apr 9, 2018 Graph Architectures for ML. Graph-based ML architectures, Pregel, Signal-collect, GraphLab, PowerGraph, GraphChi, GraphX
- Start work on Assignment 6: SSL in Spark
Wed Apr 11, 2018 SSL on Graphs. Semi-supervised learning intro, Multirank-walk SSL method, Harmonic fields, Modified Adsorption SSL method, MAD with countmin sketches
Mon Apr 16, 2018 LDA 1. DGMs for naive Bayes, Gibbs sampling for LDA
Wed Apr 18, 2018 LDA 2. Parallelizing LDA, Fast sampling for LDA, DGMs for graphs
Mon Apr 23, 2018 Parameter Servers. Parameter servers, PS vs Hadoop, State Synchronous Parallel (SSP) model, Managed Communication in PS, LDA Sampler with PS
- Previous assignment due
Wed Apr 25, 2018 Unsupervised Learning On Graphs. Spectral clustering, Power iteration clustering, Label propagation for clustering non-graph data, Label propagation for SSL on non-graph data
Mon Apr 30, 2018 Review session for final.
Wed May 2, 2018 Final Exam.

@@ Line 21: / Line 21: @@
 * Homeworks, unless otherwise posted, will be due when the next HW comes out.
 * Lecture notes and/or slides will be (re)posted around the time of the lectures.
+=== Schedule ===
 * Wed Jan 17, 2018 [[Class meeting for 10-405 Overview|Overview]].  Grading policies and etc, History of Big Data, Complexity theory and cost of important operations
@@ Line 36: / Line 38: @@
 * Wed Feb 14, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 1]].  The "delta trick", Averaged perceptrons, Debugging ML algorithms
 * Mon Feb 19, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 2]].  Hash kernels, Ranking perceptrons, Structured perceptrons
-** '''Start work on''' Assignment 3: scalable SGD; http://www.cs.cmu.edu/~wcohen/10-405/hw-3.pdf
+** '''Start work on''' Assignment 3: scalable SGD; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw3.pdf
 * Wed Feb 21, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 3]].  Iterative parameter mixing paper, Parallel SGD via Param Mixing
 * Mon Feb 26, 2018 [[Class meeting for 10-405 SGD for MF|SGD for MF]].  Matrix factorization, Matrix factorization with SGD, distributed matrix factorization with SGD

Difference between revisions of "Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018"

Revision as of 10:47, 21 February 2018

Ideas for extensions to the HW assignments

Notes

Schedule

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools