Machine Learning with Large Datasets 10-605

From Cohen Courses

Revision as of 12:30, 3 August 2011 by Wcohen (talk | contribs) (Created page with '== Description == == Planned Topics == * Week 1. Overview of course, and overview lecture on probabilities. * Week 2. Streaming learning algorithms. ** Naive Bayes for discr…')

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Jump to navigation Jump to search

Description

Planned Topics

Week 1. Overview of course, and overview lecture on probabilities.

Week 2. Streaming learning algorithms.
- Naive Bayes for discrete data.
- A streaming-data implementation of Naive Bayes.
- A streaming-data implementation of Naive Bayes assuming a larger-than-memory feature set, by using the 'stream and sort' pattern. (Generate a stream of updates to an event counter, and then sort to collect all updates to a counter locally.)
- Other stream-able learning methods.

Week 3. Examples of more complex programs using stream-and-sort.
- Finding informative phrases in a corpus and finding polar phrases in a corpus.
- Using records and messages to manage a complex dataflow.
- Time and disk-access analysis of programs.

Week 4. The map-reduce paradigm and Hadoop.

Week 5. Reducing memory usage with randomized methods.
- Feature hashing and Vowpal Wabbit.
- Bloom filters for counting events.
- Locality-sensitive hashing.

Week 6-7. Nearest-neighbor finding and bulk classification.
- Using a search engine to find approximate nearest neighbors.
- Using inverted indices to find approximate nearest neighbors or to perform bulk linear classification.
- Implementing soft joins using map-reduce and nearest-neighbor methods.
- The local k-NN graph for a dataset.

Week 8-10. Working with large graphs.
- PageRank and RWR/PPR.
- Special issues involved with iterative processing on graphs in Map-Reduce: the schimmy pattern.
  - Formalisms/environments for ierative processing on graphs: GraphLab, Sparks, Pregel.
- Extracting small graphs from a large one:
  - LocalSpectral - finding the meaningful neighborhood of a query node in a large graph.
  - Visualizing graphs.
- Semi-supervised classification on graphs.
- Clustering and community-finding in graphs.

Week 11. Stochastic gradient descent and other streaming learning algorithms.

Weeks 13-15. Additional topics.

- Scalable k-means clustering.
- Gibbs sampling and streaming LDA.
- Stacking and cascaded learning approaches.
- Decision tree learning for large datasets.

Retrieved from "http://curtis.ml.cmu.edu/w/courses/index.php?title=Machine_Learning_with_Large_Datasets_10-605&oldid=5635"

Navigation menu