Difference between revisions of "Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2012"

From Cohen Courses
Jump to navigationJump to search
Line 13: Line 13:
 
* Stream-and-sort [1.5 week]
 
* Stream-and-sort [1.5 week]
 
** Lecture: Naive Bayes when data's not in memory.
 
** Lecture: Naive Bayes when data's not in memory.
*** '''Assignment: stream-and-sort Naive Bayes'''
+
*** '''Assignment: stream-and-sort Naive Bayes''' (Twitter emoticon data?)
 
** Lecture: finding informative phrases (with vocab counts in memory).
 
** Lecture: finding informative phrases (with vocab counts in memory).
 
** Lecture: messages and records; revisit finding informative phrases.
 
** Lecture: messages and records; revisit finding informative phrases.
*** '''Assignment: finding informative phrases'''
+
*** '''Assignment: finding informative phrases''' (Google books data)
 
* Map-reduce and Hadoop [1 week]
 
* Map-reduce and Hadoop [1 week]
 
** Lecture: Alona, Map-reduce
 
** Lecture: Alona, Map-reduce

Revision as of 17:37, 20 October 2011

Schedule

  • Overviews [1 week]
    • Lecture: Overview of course, cost of various operations, asymptotic analysis
    • Lecture: Review of probabilities
  • Streaming Learning algorithms [2 weeks]
    • Lecture: Naive Bayes, and a streaming implementation of it (features in memory).
      • Assignment: streaming Naive Bayes w/ features in memory
    • Lecture: Naive Bayes and logistic regression.
    • Lecture: SGD implementation of LogReg, with lazy regularization
      • Assignment: streaming LogReg w/ features in memory
    • Lecture: other streaming methods - the perceptron algorithm and Rocchio's algorithm.
  • Stream-and-sort [1.5 week]
    • Lecture: Naive Bayes when data's not in memory.
      • Assignment: stream-and-sort Naive Bayes (Twitter emoticon data?)
    • Lecture: finding informative phrases (with vocab counts in memory).
    • Lecture: messages and records; revisit finding informative phrases.
      • Assignment: finding informative phrases (Google books data)
  • Map-reduce and Hadoop [1 week]
    • Lecture: Alona, Map-reduce
    • Lecture: Alona, Hadoop and map-reduce
      • Assignment: finding informative phrases
  • Reducing memory usage with randomized methods [1.5 weeks]
    • Lecture: Locality-sensitive hashing.
    • Lecture: Bloom filters for counting events.
      • Assignment: LSH transformation of datasets
    • Lecture: Vowpal Wabbit and the hashing trick.

Planned Topics

  • Week 6-7. Nearest-neighbor finding and bulk classification.
    • Using a search engine to find approximate nearest neighbors.
    • Using inverted indices to find approximate nearest neighbors or to perform bulk linear classification.
    • Implementing soft joins using map-reduce and nearest-neighbor methods.
    • The local k-NN graph for a dataset.
    • Assignment: Tool for approximate k-NN graph for a large dataset.
  • Week 8-10. Working with large graphs.
    • PageRank and RWR/PPR.
    • Special issues involved with iterative processing on graphs in Map-Reduce: the schimmy pattern.
      • Formalisms/environments for iterative processing on graphs: GraphLab, Sparks, Pregel.
    • Extracting small graphs from a large one:
      • LocalSpectral - finding the meaningful neighborhood of a query node in a large graph.
      • Visualizing graphs.
    • Semi-supervised classification on graphs.
    • Clustering and community-finding in graphs.
    • Assignments: Snowball sampling a graph with LocalSpectral and visualizing the results.
  • Week 11. Stochastic gradient descent and other streaming learning algorithms.
    • SGD for logistic regression.
    • Large feature sets SGD: delayed regularization-based updates; projection onto L1; truncated gradients.
    • Assignment: Proposal for a one-month project.
  • Weeks 12-15. Additional topics.
    • Scalable k-means clustering.
    • Gibbs sampling and streaming LDA.
    • Stacking and cascaded learning approaches.
    • Decision tree learning for large datasets.
    • Assignment: Writeup of project results.