Difference between revisions of "Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2012"

Revision as of 11:46, 14 November 2011

This is the syllabus for Machine Learning with Large Datasets 10-605 in Spring 2012.

January

Tues Jan 17. Overview of course, cost of various operations, asymptotic analysis. Thus Jan 19. Review of probabilities. Tues Jan 24. Streaming algorithms and Naive Bayes.

Assignment: streaming Naive Bayes 1 (with feature counts in memory)

Thus Jan 26. Naive Bayes and logistic regression. Tues Jan 31. Streaming stochastic gradient descent.

Assignment: streaming logistic regression (with feature weghts in memory)

February

Thus Feb 2. The stream-and-sort design pattern; Naive Bayes. Tues Feb 7. Messages and records; finding informative phrases.

Assignment: streaming Naive Bayes 2 (with feature counts on disk)

Thus Feb 9. Map-reduce and Hadoop 1 (Alona lecture). Tues Feb 14. Map-reduce and Hadoop 2. (Alona lecture).

Assignment: finding informative phrases

Draft

Overviews [1 week]
- Lecture: Overview of course, cost of various operations, asymptotic analysis
- Lecture: Review of probabilities
Streaming Learning algorithms [2 weeks]
- Lecture: Naive Bayes, and a streaming implementation of it (features in memory).
  - Assignment: streaming Naive Bayes w/ features in memory
- Lecture: Naive Bayes and logistic regression.
- Lecture: SGD implementation of LogReg, with lazy regularization
  - Assignment: streaming LogReg w/ features in memory
- Lecture: other streaming methods - the perceptron algorithm and Rocchio's algorithm.
Stream-and-sort [1.5 week]
- Lecture: Naive Bayes when data's not in memory.
  - Assignment: stream-and-sort Naive Bayes (Twitter emoticon data?)
- Lecture: finding informative phrases (with vocab counts in memory).
- Lecture: messages and records; revisit finding informative phrases.
  - Assignment: finding informative phrases (Google books data)
Map-reduce and Hadoop [1 week]
- Lecture: Alona, Map-reduce
- Lecture: Alona, Hadoop and map-reduce
  - Assignment: finding informative phrases
Reducing memory usage with randomized methods [1.5 weeks]
- Lecture: Locality-sensitive hashing.
- Lecture: Bloom filters for counting events.
  - Assignment: LSH transformation of datasets
- Lecture: Vowpal Wabbit and the hashing trick.

Planned Topics

Week 6-7. Nearest-neighbor finding and bulk classification.
- Using a search engine to find approximate nearest neighbors.
- Using inverted indices to find approximate nearest neighbors or to perform bulk linear classification.
- Implementing soft joins using map-reduce and nearest-neighbor methods.
- The local k-NN graph for a dataset.
- Assignment: Tool for approximate k-NN graph for a large dataset.

Week 8-10. Working with large graphs.
- PageRank and RWR/PPR.
- Special issues involved with iterative processing on graphs in Map-Reduce: the schimmy pattern.
  - Formalisms/environments for iterative processing on graphs: GraphLab, Sparks, Pregel.
- Extracting small graphs from a large one:
  - LocalSpectral - finding the meaningful neighborhood of a query node in a large graph.
  - Visualizing graphs.
- Semi-supervised classification on graphs.
- Clustering and community-finding in graphs.
- Assignments: Snowball sampling a graph with LocalSpectral and visualizing the results.

Week 11. Stochastic gradient descent and other streaming learning algorithms.
- SGD for logistic regression.
- Large feature sets SGD: delayed regularization-based updates; projection onto L1; truncated gradients.
- Assignment: Proposal for a one-month project.

Weeks 12-15. Additional topics.
- Scalable k-means clustering.
- Gibbs sampling and streaming LDA.
- Stacking and cascaded learning approaches.
- Decision tree learning for large datasets.
- Assignment: Writeup of project results.

@@ Line 15: / Line 15: @@
 Thus Feb 2. The stream-and-sort design pattern; Naive Bayes.
 Tues Feb 7. Messages and records; finding informative phrases.
-*** '''Assignment: streaming Naive Bayes 2 (with feature counts on disk)'''
+* '''Assignment: streaming Naive Bayes 2 (with feature counts on disk)'''
 Thus Feb 9. Map-reduce and Hadoop 1 (Alona lecture).
 Tues Feb 14. Map-reduce and Hadoop 2. (Alona lecture).
-*** '''Assignment: finding informative phrases'''
+* '''Assignment: finding informative phrases'''
 == Draft ==

Difference between revisions of "Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2012"

Revision as of 11:46, 14 November 2011

Contents

January

February

Draft

Planned Topics

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools