Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018

From Cohen Courses
Jump to: navigation, search

This is the syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018.

Ideas for open-ended extensions to the HW assignments

This is not a complete list! you can use any of these as a starting point, but feel free to think up your own extensions. Any open-ended extensions must be submitted no later than midnight May 6 to be considered for grading.

HW2 (NB in GuineaPig):

  • The assignment proposes one particular scheme for parallelizing the training/testing algorithm. Consider another parallelization algorithm.
  • Implement a similarly scalable Rocchio algorithm and compare it with NB.
  • Reimplement the same algorithm in Spark (or some other dataflow language) and compare.
  • One or the extensions to GuineaPig not discussed in class is an in-memory map-reduce system. Design an experiment that makes use of this constructively.

HW3 (Logistic regression and SGD)

  • Evaluate the hash trick for Naive Bayes systematically on a series of datasets.
  • Implement a parameter-mixing version of logistic regression and evaluate it.
  • A recent paper proposes (roughly) using SVM with NB-transformed features. Implement this and compare.
  • The personalization method described in class is based on a transfer learning method which works similarly. Many wikipedia pages are available in multiple languages, and works in related languages tend to be lexically similar (eg, "astrónomo" is Spanish for "astronomer"). Suppose features were character n-grams (eg "astr", "stro", "tron", ...) - does domain transfer work for the task of classifying wikipedia pages? Construct a dataset and experiment to test this hypothesis.

HW4/5 (Autodiff)

  • Find a recent paper which includes data and an implementation for a non-trivial neural model in a standard framework (eg pyTorch or TensorFlow), and port the model to your autodiff framework.
  • On a machine with multiple CPUs, use the multiprocessing and multiprocessing.pool framework to parallelize gradient computation on CPUs. The architecture you should build should stream through the data, and construct multiple tasks which require a worker to perform a minibatch update, then broadcast the parameter updates to a another worker that will accumulate them. (So this system would be doing delayed SGD on minibatches). Perform some experiments to see how performance (time and accuracy!) changes as you increase the number of processors. What would be advantages of this sort of architecture over a GPU-based one?

HW6 (SSL):

  • Implement the optimization for modified adsorption (MAD) and compare
  • Implement the sketch-based approach for SSL described in the paper below and compare: Partha Pratim Talukdar and William W. Cohen (2014): Scaling Graph-based Semi Supervised Learning to Large Number of Labels Using Count-Min Sketch in AI-Stats 2014.


  • Homeworks, unless otherwise posted, will be due when the next HW comes out.
  • Lecture notes and/or slides will be (re)posted around the time of the lectures.