Difference between revisions of "Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018"

From Cohen Courses
Jump to navigationJump to search
 
(18 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
This is the syllabus for [[Machine Learning with Large Datasets 10-405 in Spring 2018]].   
 
This is the syllabus for [[Machine Learning with Large Datasets 10-405 in Spring 2018]].   
 +
 +
== Ideas for open-ended extensions to the HW assignments ==
 +
 +
This is not a complete list! you can use any of these as a starting point, but feel free to think up your own extensions.
 +
Any open-ended extensions must be submitted no later than '''midnight May 6''' to be considered for grading.
 +
 +
HW2 (NB in GuineaPig):
 +
 +
* The assignment proposes one particular scheme for parallelizing the training/testing algorithm.  Consider another parallelization algorithm.
 +
* Implement a similarly scalable Rocchio algorithm and compare it with NB.
 +
* Reimplement the same algorithm in Spark (or some other dataflow language) and compare.
 +
* One or the extensions to GuineaPig not discussed in class is an in-memory map-reduce system.  Design an experiment that makes use of this constructively.
 +
 +
HW3 (Logistic regression and SGD)
 +
* Evaluate the hash trick for Naive Bayes systematically on a series of datasets.
 +
* Implement a parameter-mixing version of logistic regression and evaluate it.
 +
* A [https://www.aclweb.org/anthology/P12-2018 recent paper] proposes (roughly) using SVM with NB-transformed features.  Implement this and compare.
 +
* The personalization method described in class is based on [https://www.umiacs.umd.edu/~hal/docs/daume07easyadapt.pdf a transfer learning method] which works similarly. Many wikipedia pages are available in multiple languages, and works in related languages tend to be lexically similar (eg, "astrónomo" is Spanish for "astronomer").  Suppose features were character n-grams (eg "astr", "stro", "tron", ...) - does domain transfer work for the task of classifying wikipedia pages?  Construct a dataset and experiment to test this hypothesis.
 +
 +
HW4/5 (Autodiff)
 +
* Find a recent paper which includes data and an implementation for a non-trivial neural model in a standard framework (eg pyTorch or TensorFlow), and port the model to your autodiff framework.
 +
* On a machine with multiple CPUs, use the <code>multiprocessing</code> and <code>multiprocessing.pool</code> framework to parallelize gradient computation on CPUs.  The architecture you should build should stream through the data, and construct multiple tasks which require a worker to perform a minibatch update, then broadcast the parameter updates to a another worker that will accumulate them.  (So this system would be doing delayed SGD on minibatches).  Perform some experiments to see how performance (time and accuracy!) changes as you increase the number of processors.  What would be advantages of this sort of architecture over a GPU-based one?
 +
 +
HW6 (SSL):
 +
* Implement the optimization for modified adsorption (MAD) and compare
 +
* Implement the sketch-based approach for SSL described in the paper below and compare: Partha Pratim Talukdar and William W. Cohen (2014): Scaling Graph-based Semi Supervised Learning to Large Number of Labels Using Count-Min Sketch in AI-Stats 2014.
  
 
=== Notes ===
 
=== Notes ===
Line 5: Line 31:
 
* Homeworks, unless otherwise posted, will be due when the next HW comes out.
 
* Homeworks, unless otherwise posted, will be due when the next HW comes out.
 
* Lecture notes and/or slides will be (re)posted around the time of the lectures.
 
* Lecture notes and/or slides will be (re)posted around the time of the lectures.
 +
 +
=== Schedule ===
  
 
* Wed Jan 17, 2018 [[Class meeting for 10-405 Overview|Overview]].  Grading policies and etc, History of Big Data, Complexity theory and cost of important operations
 
* Wed Jan 17, 2018 [[Class meeting for 10-405 Overview|Overview]].  Grading policies and etc, History of Big Data, Complexity theory and cost of important operations
 
* Mon Jan 22, 2018 [[Class meeting for 10-405 Probability Review|Probability Review]].  Counting for big data and density estimation, streaming Naive Bayes, Rocchio and TFIDF
 
* Mon Jan 22, 2018 [[Class meeting for 10-405 Probability Review|Probability Review]].  Counting for big data and density estimation, streaming Naive Bayes, Rocchio and TFIDF
** '''Start work on''' Assignment 1a: Streaming NB; Draft at https://autolab.andrew.cmu.edu/courses/10405-s18/assessments/hw1astreamingnaivebayes/writeup
+
** '''Start work on''' Assignment 1a: Streaming NB; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw1a.pdf
 
* Wed Jan 24, 2018 [[Class meeting for 10-405 Streaming Naive Bayes|Streaming Naive Bayes]].  Notes on scalable naive bayes, Alternatives to stream and sort, Local counting in stream and sort, Stream and sort examples
 
* Wed Jan 24, 2018 [[Class meeting for 10-405 Streaming Naive Bayes|Streaming Naive Bayes]].  Notes on scalable naive bayes, Alternatives to stream and sort, Local counting in stream and sort, Stream and sort examples
 
* Mon Jan 29, 2018 [[Class meeting for 10-405 Hadoop Overview|Hadoop Overview]].  Intro to Hadoop, Hadoop Streaming, Debugging Hadoop, Combiners
 
* Mon Jan 29, 2018 [[Class meeting for 10-405 Hadoop Overview|Hadoop Overview]].  Intro to Hadoop, Hadoop Streaming, Debugging Hadoop, Combiners
** '''Start work on''' Assignment 1b: Streaming NB on Hadoop; Draft at https://autolab.andrew.cmu.edu/courses/10405-s18/assessments/hw1bhadoopnaivebayes/writeup
+
** '''Start work on''' Assignment 1b: Streaming NB on Hadoop; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw1b.pdf
 
* Wed Jan 31, 2018 [[Class meeting for 10-405 Workflows For Hadoop|Workflows For Hadoop 1]].  Scalable classification, Abstracts for map-reduce algorithms, Joins in Hadoop
 
* Wed Jan 31, 2018 [[Class meeting for 10-405 Workflows For Hadoop|Workflows For Hadoop 1]].  Scalable classification, Abstracts for map-reduce algorithms, Joins in Hadoop
 
* Mon Feb 5, 2018 [[Class meeting for 10-405 Workflows For Hadoop|Workflows For Hadoop 2]].  Guinea Pig intro, Similarity joins, Similarity joins with TFIDF, Parallel simjoins
 
* Mon Feb 5, 2018 [[Class meeting for 10-405 Workflows For Hadoop|Workflows For Hadoop 2]].  Guinea Pig intro, Similarity joins, Similarity joins with TFIDF, Parallel simjoins
** '''Start work on''' Assignment 2: Naive bayes testing in Guinea Pig; Draft at https://autolab.andrew.cmu.edu/courses/10405-s18/assessments/hw2anbwithguineapig/writeup
+
** '''Start work on''' Assignment 2a: Naive bayes training in Guinea Pig; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw2a.pdf
 
* Wed Feb 7, 2018 [[Class meeting for 10-405 Workflows For Hadoop|Workflows For Hadoop 3]].  PageRank, PageRank in Pig and Guinea Pig, K-means in Pig, Spark, Systems built on top of Hadoop
 
* Wed Feb 7, 2018 [[Class meeting for 10-405 Workflows For Hadoop|Workflows For Hadoop 3]].  PageRank, PageRank in Pig and Guinea Pig, K-means in Pig, Spark, Systems built on top of Hadoop
 
* Mon Feb 12, 2018 [[Class meeting for 10-405 SGD and Hash Kernels|SGD and Hash Kernels]].  Learning as optimization, Logistic regression with SGD, Regularized SGD, Efficient regularized SGD, Hash kernels for logistic regression
 
* Mon Feb 12, 2018 [[Class meeting for 10-405 SGD and Hash Kernels|SGD and Hash Kernels]].  Learning as optimization, Logistic regression with SGD, Regularized SGD, Efficient regularized SGD, Hash kernels for logistic regression
 +
** '''Start work on''' Assignment 2b:  Naive bayes testing in Guinea Pig. http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw2b.pdf
 
* Wed Feb 14, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 1]].  The "delta trick", Averaged perceptrons, Debugging ML algorithms
 
* Wed Feb 14, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 1]].  The "delta trick", Averaged perceptrons, Debugging ML algorithms
** '''Start work on''' Assignment 3: scalable SGD; Draft at http://www.cs.cmu.edu/~wcohen/10-405/assignments/2016-fall/hw-3-sga-logreg/main.pdf
 
 
* Mon Feb 19, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 2]].  Hash kernels, Ranking perceptrons, Structured perceptrons
 
* Mon Feb 19, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 2]].  Hash kernels, Ranking perceptrons, Structured perceptrons
 +
** '''Start work on''' Assignment 3: scalable SGD; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw3.pdf
 
* Wed Feb 21, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 3]].  Iterative parameter mixing paper, Parallel SGD via Param Mixing
 
* Wed Feb 21, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 3]].  Iterative parameter mixing paper, Parallel SGD via Param Mixing
 
* Mon Feb 26, 2018 [[Class meeting for 10-405 SGD for MF|SGD for MF]].  Matrix factorization, Matrix factorization with SGD, distributed matrix factorization with SGD
 
* Mon Feb 26, 2018 [[Class meeting for 10-405 SGD for MF|SGD for MF]].  Matrix factorization, Matrix factorization with SGD, distributed matrix factorization with SGD
* Wed Feb 28, 2018 [[Class meeting for 10-405 Guest lecture - tentative|Guest lecture - tentative]].
+
* Wed Feb 28, 2018 [[Class meeting for 10-405 Guest lecture - tentative|Guest lecture]] -  [http://www.cs.cmu.edu/~kijungs/ Kijung Shin]
** '''Last assignment due'''
 
 
* Mon Mar 5, 2018 [[Class meeting for 10-405 Midterm review and catchup|Midterm review and catchup]].  Midterm review
 
* Mon Mar 5, 2018 [[Class meeting for 10-405 Midterm review and catchup|Midterm review and catchup]].  Midterm review
 +
** '''Previous assignment due'''
 
* Wed Mar 7, 2018 [[Class meeting for 10-405 Midterm|Midterm]].   
 
* Wed Mar 7, 2018 [[Class meeting for 10-405 Midterm|Midterm]].   
* Mon Mar 19, 2018 [[Class meeting for 10-405 Computing with GPUs|Computing with GPUs]].  Introduction to GPUs, CUDA, Vectorization
+
* Mon Mar 19, 2018 [[Class meeting for 10-405 Deep Learning|Deep Learning 1]].  BackProp following Nielson, Deep learning and GPUs, Reverse-mode differentiation (autodiff)
* Wed Mar 21, 2018 [[Class meeting for 10-405 Deep Learning|Deep Learning 1]].  Deep learning intro, BackProp following Nielson, Expressiveness of MLPs, Deep learning and GPUs, Exploding and vanishing gradients, Modern deep learning models
+
* Wed Mar 21, 2018 [[Class meeting for 10-405 Deep Learning|Deep Learning 2]].  Reverse-mode differentiation (autodiff), Some systems using autodiff, Details on Wengert lists, Breakdown of xman.py, Inputs, parameters, updates
* Mon Mar 26, 2018 [[Class meeting for 10-405 Deep Learning|Deep Learning 2]].  Reverse-mode differentiation (autodiff), Some systems using autodiff, Details on Wengert lists, Breakdown of xman.py
+
** '''Start work on''' Assignment 4: Autodiff with IPM part 1/2; Draft at http://www.cs.cmu.edu/~wcohen/10-605/assignments/2017-fall/hw-4+5-autodiff/hw4/main.pdf
** '''Start work on''' Assignment 4: Autodiff with IPM part 1/2; Draft at http://www.cs.cmu.edu/~wcohen/10-405/assignments/2016-fall/hw-5-autodiff/main.pdf
+
* Mon Mar 26, 2018 [[Class meeting for 10-405 Deep Learning|Deep Learning 3]].  Expressiveness of MLPs, Exploding and vanishing gradients, Modern deep learning models, Word2vec and GloVE, Recursive ANNs, Convolutional ANNs, Achitectures using RNNs
* Wed Mar 28, 2018 [[Class meeting for 10-405 Deep Learning|Deep Learning 3]].  Inputs, parameters, updates, Word2vec and GloVE, Recursive ANNs, Convolutional ANNs, Achitectures using RNNs
+
* Wed Mar 28, 2018 [[Class meeting for 10-405 Computing with GPUs|Computing with GPUs]].  Introduction to GPUs, CUDA
 +
* Fri Mar 30, 2018
 +
** '''Start work on''' Assignment 5: Autodiff with IPM part 2/2
 
* Mon Apr 2, 2018 [[Class meeting for 10-405 Randomized Algorithms|Randomized Algorithms 1]].  Bloom filters, The countmin sketch, CM Sketches in Deep Learning
 
* Mon Apr 2, 2018 [[Class meeting for 10-405 Randomized Algorithms|Randomized Algorithms 1]].  Bloom filters, The countmin sketch, CM Sketches in Deep Learning
 +
** '''HW 4 is due'''
 
* Wed Apr 4, 2018 [[Class meeting for 10-405 Randomized Algorithms|Randomized Algorithms 2]].  Review of Bloom filters, Locality sensitive hashing, Online LSH
 
* Wed Apr 4, 2018 [[Class meeting for 10-405 Randomized Algorithms|Randomized Algorithms 2]].  Review of Bloom filters, Locality sensitive hashing, Online LSH
** '''Start work on''' Assignment 5: Autodiff with IPM part 2/2
 
 
* Mon Apr 9, 2018 [[Class meeting for 10-405 Graph Architectures for ML|Graph Architectures for ML]].  Graph-based ML architectures, Pregel, Signal-collect, GraphLab, PowerGraph, GraphChi, GraphX
 
* Mon Apr 9, 2018 [[Class meeting for 10-405 Graph Architectures for ML|Graph Architectures for ML]].  Graph-based ML architectures, Pregel, Signal-collect, GraphLab, PowerGraph, GraphChi, GraphX
 
* Wed Apr 11, 2018 [[Class meeting for 10-405 SSL on Graphs|SSL on Graphs]].  Semi-supervised learning intro, Multirank-walk SSL method, Harmonic fields, Modified Adsorption SSL method, MAD with countmin sketches
 
* Wed Apr 11, 2018 [[Class meeting for 10-405 SSL on Graphs|SSL on Graphs]].  Semi-supervised learning intro, Multirank-walk SSL method, Harmonic fields, Modified Adsorption SSL method, MAD with countmin sketches
 +
** '''Start work on''' Assignment 6: SSL on a graph in Spark maybe using NELL data?
 
* Mon Apr 16, 2018 [[Class meeting for 10-405 LDA|LDA 1]].  DGMs for naive Bayes, Gibbs sampling for LDA
 
* Mon Apr 16, 2018 [[Class meeting for 10-405 LDA|LDA 1]].  DGMs for naive Bayes, Gibbs sampling for LDA
 
* Wed Apr 18, 2018 [[Class meeting for 10-405 LDA|LDA 2]].  Parallelizing LDA, Fast sampling for LDA, DGMs for graphs
 
* Wed Apr 18, 2018 [[Class meeting for 10-405 LDA|LDA 2]].  Parallelizing LDA, Fast sampling for LDA, DGMs for graphs
* Mon Apr 23, 2018 [[Class meeting for 10-405 Parameter Servers|Parameter Servers]].  Parameter servers, PS vs Hadoop, State Synchronous Parallel (SSP) model, Managed Communication in PS, LDA Sampler with PS
+
* Mon Apr 23, 2018 [[Class meeting for 10-405 Unsupervised Learning On Graphs|Unsupervised Learning On Graphs]].  Spectral clustering, Power iteration clustering, Label propagation for clustering non-graph data, Label propagation for SSL on non-graph data
* Wed Apr 25, 2018 [[Class meeting for 10-405 Unsupervised Learning On Graphs|Unsupervised Learning On Graphs]].  Spectral clustering, Power iteration clustering, Label propagation for clustering non-graph data, Label propagation for SSL on non-graph data
+
* Wed Apr 25, 2018 [[Class meeting for 10-405 Parameter Servers|Parameter Servers]].  Parameter servers, PS vs Hadoop, State Synchronous Parallel (SSP) model, Managed Communication in PS, LDA Sampler with PS
 +
** '''Last assignment due'''
 
* Mon Apr 30, 2018 [[Class meeting for 10-405 Review session for final|Review session for final]].   
 
* Mon Apr 30, 2018 [[Class meeting for 10-405 Review session for final|Review session for final]].   
 
* Wed May 2, 2018 [[Class meeting for 10-405 Final Exam|Final Exam]].
 
* Wed May 2, 2018 [[Class meeting for 10-405 Final Exam|Final Exam]].

Latest revision as of 13:46, 30 April 2018

This is the syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018.

Ideas for open-ended extensions to the HW assignments

This is not a complete list! you can use any of these as a starting point, but feel free to think up your own extensions. Any open-ended extensions must be submitted no later than midnight May 6 to be considered for grading.

HW2 (NB in GuineaPig):

  • The assignment proposes one particular scheme for parallelizing the training/testing algorithm. Consider another parallelization algorithm.
  • Implement a similarly scalable Rocchio algorithm and compare it with NB.
  • Reimplement the same algorithm in Spark (or some other dataflow language) and compare.
  • One or the extensions to GuineaPig not discussed in class is an in-memory map-reduce system. Design an experiment that makes use of this constructively.

HW3 (Logistic regression and SGD)

  • Evaluate the hash trick for Naive Bayes systematically on a series of datasets.
  • Implement a parameter-mixing version of logistic regression and evaluate it.
  • A recent paper proposes (roughly) using SVM with NB-transformed features. Implement this and compare.
  • The personalization method described in class is based on a transfer learning method which works similarly. Many wikipedia pages are available in multiple languages, and works in related languages tend to be lexically similar (eg, "astrónomo" is Spanish for "astronomer"). Suppose features were character n-grams (eg "astr", "stro", "tron", ...) - does domain transfer work for the task of classifying wikipedia pages? Construct a dataset and experiment to test this hypothesis.

HW4/5 (Autodiff)

  • Find a recent paper which includes data and an implementation for a non-trivial neural model in a standard framework (eg pyTorch or TensorFlow), and port the model to your autodiff framework.
  • On a machine with multiple CPUs, use the multiprocessing and multiprocessing.pool framework to parallelize gradient computation on CPUs. The architecture you should build should stream through the data, and construct multiple tasks which require a worker to perform a minibatch update, then broadcast the parameter updates to a another worker that will accumulate them. (So this system would be doing delayed SGD on minibatches). Perform some experiments to see how performance (time and accuracy!) changes as you increase the number of processors. What would be advantages of this sort of architecture over a GPU-based one?

HW6 (SSL):

  • Implement the optimization for modified adsorption (MAD) and compare
  • Implement the sketch-based approach for SSL described in the paper below and compare: Partha Pratim Talukdar and William W. Cohen (2014): Scaling Graph-based Semi Supervised Learning to Large Number of Labels Using Count-Min Sketch in AI-Stats 2014.

Notes

  • Homeworks, unless otherwise posted, will be due when the next HW comes out.
  • Lecture notes and/or slides will be (re)posted around the time of the lectures.

Schedule