Machine Learning with Large Datasets 10-605

From Cohen Courses
Revision as of 12:30, 3 August 2011 by Wcohen (talk | contribs) (Created page with '== Description == == Planned Topics == * Week 1. Overview of course, and overview lecture on probabilities. * Week 2. Streaming learning algorithms. ** Naive Bayes for discr…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Description

Planned Topics

  • Week 1. Overview of course, and overview lecture on probabilities.
  • Week 2. Streaming learning algorithms.
    • Naive Bayes for discrete data.
    • A streaming-data implementation of Naive Bayes.
    • A streaming-data implementation of Naive Bayes assuming a larger-than-memory feature set, by using the 'stream and sort' pattern. (Generate a stream of updates to an event counter, and then sort to collect all updates to a counter locally.)
    • Other stream-able learning methods.
  • Week 3. Examples of more complex programs using stream-and-sort.
    • Finding informative phrases in a corpus and finding polar phrases in a corpus.
    • Using records and messages to manage a complex dataflow.
    • Time and disk-access analysis of programs.
  • Week 4. The map-reduce paradigm and Hadoop.
  • Week 5. Reducing memory usage with randomized methods.
    • Feature hashing and Vowpal Wabbit.
    • Bloom filters for counting events.
    • Locality-sensitive hashing.
  • Week 6-7. Nearest-neighbor finding and bulk classification.
    • Using a search engine to find approximate nearest neighbors.
    • Using inverted indices to find approximate nearest neighbors or to perform bulk linear classification.
    • Implementing soft joins using map-reduce and nearest-neighbor methods.
    • The local k-NN graph for a dataset.
  • Week 8-10. Working with large graphs.
    • PageRank and RWR/PPR.
    • Special issues involved with iterative processing on graphs in Map-Reduce: the schimmy pattern.
      • Formalisms/environments for ierative processing on graphs: GraphLab, Sparks, Pregel.
    • Extracting small graphs from a large one:
      • LocalSpectral - finding the meaningful neighborhood of a query node in a large graph.
      • Visualizing graphs.
    • Semi-supervised classification on graphs.
    • Clustering and community-finding in graphs.

Week 11. Stochastic gradient descent and other streaming learning algorithms.

Weeks 13-15. Additional topics.

    • Scalable k-means clustering.
    • Gibbs sampling and streaming LDA.
    • Stacking and cascaded learning approaches.
    • Decision tree learning for large datasets.