Difference between revisions of "Machine Learning with Large Datasets 10-605 in Spring 2012"

From Cohen Courses
Jump to navigationJump to search
 
(18 intermediate revisions by 2 users not shown)
Line 5: Line 5:
 
* When/where: Tues/Thus 1:30-2:50pm, NSH 1305
 
* When/where: Tues/Thus 1:30-2:50pm, NSH 1305
 
* Course Number: ML 10-605
 
* Course Number: ML 10-605
* Prerequisites: a machine learning course (e.g., 10-701 or 10-601) must be taken either before, or concurrently with, this course.
+
* Prerequisites:  
* TA: [http://www.cs.cmu.edu/~afyshe/ Alona Fyshe]
+
** a machine learning course (e.g., 10-701 or 10-601).  You may take this concurrently with the instructor's permission. 
 +
** Java programming skills, e.g., 15-210, or 15-214.
 +
* TAs: [http://www.cs.cmu.edu/~afyshe/ Alona Fyshe] and [http://www.cs.cmu.edu/~nlao/ Ni Lao]
 
* Syllabus: [[Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2012]]
 
* Syllabus: [[Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2012]]
* Office hours: TBA
+
* Office hours:
 +
** Alona: 10am Wed
 +
** Ni: 3-4:30 Thus, Hillman 5507
 +
** William: 11am Fri
 +
* You can post or answer questions at our [http://groups.google.com/group/machine-learning-with-large-datasets-10-605-in-spring-2012  course discussion group]
  
 
== Description ==
 
== Description ==
Line 16: Line 22:
 
This course is intended to provide a student practical knowledge of, and experience with, the issues involving large datasets.  Among the issues considered are: scalable learning techniques, such as streaming machine learning techniques; parallel infrastructures such as map-reduce; practical techniques for reducing the memory requirements for learning methods, such as feature hashing and Bloom filters; and techniques for analysis of programs in terms of memory, disk usage, and (for parallel methods) communication complexity.   
 
This course is intended to provide a student practical knowledge of, and experience with, the issues involving large datasets.  Among the issues considered are: scalable learning techniques, such as streaming machine learning techniques; parallel infrastructures such as map-reduce; practical techniques for reducing the memory requirements for learning methods, such as feature hashing and Bloom filters; and techniques for analysis of programs in terms of memory, disk usage, and (for parallel methods) communication complexity.   
  
The class will include frequent programming assignments, and a one-month short project chosen by the student.  The project will be designed to compare the scalability of variant learning algorithms on datasets.
+
The class will include frequent programming assignments, and a one-month short project chosen by the student.  The project should be relevant to the course - e.g.,  to compare the scalability of variant learning algorithms on datasets.
 +
 
 +
== Syllabus ==
 +
 
 +
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2012]]
  
 
== Prerequisites ==
 
== Prerequisites ==
Line 25: Line 35:
  
 
Undergraduates need permission of the instructor to enroll.
 
Undergraduates need permission of the instructor to enroll.
 
== Planned Topics ==
 
 
''Draft - subject to change!''
 
 
* Week 1. Overview of course, and overview lecture on probabilities.
 
 
* Week 2. Streaming learning algorithms.
 
** Naive Bayes for discrete data. 
 
** A streaming-data implementation of Naive Bayes.
 
** A streaming-data implementation of Naive Bayes assuming a larger-than-memory feature set, by using the 'stream and sort' pattern.
 
** Discussion of other streaming learning methods.
 
*** Rocchio
 
*** Perceptron-style algorithms
 
*** Streaming regression?
 
** '''Assignment''': two implementations of Naive Bayes, one with feature-weights in memory, one purely streaming.
 
 
* Week 3. Examples of more complex programs using stream-and-sort.
 
** Lecture topics:
 
*** Finding informative phrases in a corpus, and finding polar phrases in a corpus.
 
*** Using records and messages to manage a complex dataflow.
 
** '''Assignment''': phrase-finding and sentiment classification
 
 
* Week 4. The map-reduce paradigm and Hadoop.
 
** '''Assignment''': Hadoop re-implementation of assignments 1/2.
 
 
* Week 5. Reducing memory usage with randomized methods.
 
** Feature hashing and Vowpal Wabbit.
 
** Bloom filters for counting events.
 
** Locality-sensitive hashing.
 
** '''Assignment''': memory-efficient Naive Bayes.
 
 
* Week 6-7. Nearest-neighbor finding and bulk classification.
 
** Using a search engine to find approximate nearest neighbors.
 
** Using inverted indices to find approximate nearest neighbors or to perform bulk linear classification.
 
** Implementing soft joins using map-reduce and nearest-neighbor methods.
 
** The local k-NN graph for a dataset.
 
** '''Assignment''': Tool for approximate k-NN graph for a large dataset.
 
 
* Week 8-10. Working with large graphs.
 
** PageRank and RWR/PPR.
 
** Special issues involved with iterative processing on graphs in Map-Reduce: the schimmy pattern.
 
*** Formalisms/environments for iterative processing on graphs: GraphLab, Sparks, Pregel.
 
** Extracting small graphs from a large one:
 
*** LocalSpectral - finding the meaningful neighborhood of a query node in a large graph.
 
*** Visualizing graphs.
 
** Semi-supervised classification on graphs.
 
** Clustering and community-finding in graphs.
 
** '''Assignments''': Snowball sampling a graph with LocalSpectral and visualizing the results.
 
 
* Week 11. Stochastic gradient descent and other streaming learning algorithms.
 
** SGD for logistic regression.
 
** Large feature sets SGD: delayed regularization-based updates; projection onto L1; truncated gradients.
 
** '''Assignment''': Proposal for a one-month project.
 
 
* Weeks 12-15. Additional topics.
 
** Scalable k-means clustering.
 
** Gibbs sampling and streaming LDA.
 
** Stacking and cascaded learning approaches.
 
** Decision tree learning for large datasets.
 
** '''Assignment''': Writeup of project results.
 
  
 
== Datasets ==
 
== Datasets ==
Line 97: Line 46:
 
* ?Million Song Database - audio signatures of songs with tags and meta-data.
 
* ?Million Song Database - audio signatures of songs with tags and meta-data.
 
* ?KDD search-engine queries.
 
* ?KDD search-engine queries.
 +
 +
== Projects ==
 +
 +
* [[Projects for Machine Learning with Large Datasets 10-605 in Spring 2012]]
 +
 +
''' Schedule of Presentations'''
 +
* We first put all projects into two groups based on their topics, and then assign each group to a presentation day.
 +
 +
''' May 1 --- NELL, tencent, and Twitter '''
 +
* '''1 NELL KB classification with SV- -VO features ''' by Guo Chen
 +
* '''2 SVO relation classification with LogReg/SVM ''' by Malcolm Greaves
 +
* '''3 pic co-clustering for verb-noun pairs ''' by Philip Gianfortoni , Mahesh Joshi
 +
* '''4 clickthru rate prediction ''' by Lingjuan Peng, Yijia Zhang
 +
* '''5 tencent weibo link prediction ''' by Rui Du, Tianle Huang
 +
* '''6 tencent weibo social recommendation ''' by Yifu Diao
 +
* '''7 TwitLDA ''' by Supreeth Selvaraj
 +
* '''8 run pagerank on twitter hashtags - GraphLab and MR ''' by Yogesh Dalal, Dongzhen Piao
 +
* '''9 twitter Spam ''' by Elmer Garduno Hernandez
 +
 +
''' May 3 --- other stuff '''
 +
* '''1 SVO relation classification ''' by Mridul Gupta, Anuroop Sriram, Mahaveer
 +
* '''2 WSD trained on WP sense-page links ''' by Andrew Rodriguez
 +
* '''3 GPS smoothing for bus data ''' by Eliot Knudsen
 +
* '''4 memory-caching HDFS ''' by Philip Brown
 +
* '''5 stacked LSH-based predictors ''' by Benjamin Eckart
 +
* '''6 Netflix - ensemble of matrix factorization and kNN ''' by Peng Zhang
 +
* '''7 hybrid parallel SGD coordinate descent ''' by Chong Tat Chua, Dongyeop Kang
 +
* '''8 RCV1 heirarchical classification ''' by Tarun Sharma
 +
* '''9 brains and nouns ''' by Seshadri Sridharan

Latest revision as of 11:34, 1 May 2012

Instructor and Venue

Description

Large datasets are difficult to work with for several reasons. They are difficult to visualize, and it is difficult to understand what sort of errors and biases are present in them. They are computationally expensive to process, and often the cost of learning is hard to predict - for instance, and algorithm that runs quickly in a dataset that fits in memory may be exorbitantly expensive when the dataset is too large for memory. Large datasets may also display qualitatively different behavior in terms of which learning methods produce the most accurate predictions.

This course is intended to provide a student practical knowledge of, and experience with, the issues involving large datasets. Among the issues considered are: scalable learning techniques, such as streaming machine learning techniques; parallel infrastructures such as map-reduce; practical techniques for reducing the memory requirements for learning methods, such as feature hashing and Bloom filters; and techniques for analysis of programs in terms of memory, disk usage, and (for parallel methods) communication complexity.

The class will include frequent programming assignments, and a one-month short project chosen by the student. The project should be relevant to the course - e.g., to compare the scalability of variant learning algorithms on datasets.

Syllabus

Prerequisites

An introductory course in machine learning, like 10-601 or 10-701, is a prerequisite or a co-requisite. If you plan to take this course and 10-601 concurrently please tell the instructor.

The course will include several substantial programming assignments, so an additional prerequisite is 15-210, or 15-214, or comparable familiarity with Java and good programming skills.

Undergraduates need permission of the instructor to enroll.

Datasets

Some datasets will be provided by the instructors to use in the course.

  • RCV2 - text classification dataset.
  • Wikipedia links - page-page links for Wikipedia.
  • Geographical names and places - data on places from GeoNames, Wikipedia, and Geo-tagged Flikr images.
  • NELL all-pairs data - NPs and the contexts they appear in on the web.
  • Google n-grams.
  • ?Million Song Database - audio signatures of songs with tags and meta-data.
  • ?KDD search-engine queries.

Projects

Schedule of Presentations

  • We first put all projects into two groups based on their topics, and then assign each group to a presentation day.

May 1 --- NELL, tencent, and Twitter

  • 1 NELL KB classification with SV- -VO features by Guo Chen
  • 2 SVO relation classification with LogReg/SVM by Malcolm Greaves
  • 3 pic co-clustering for verb-noun pairs by Philip Gianfortoni , Mahesh Joshi
  • 4 clickthru rate prediction by Lingjuan Peng, Yijia Zhang
  • 5 tencent weibo link prediction by Rui Du, Tianle Huang
  • 6 tencent weibo social recommendation by Yifu Diao
  • 7 TwitLDA by Supreeth Selvaraj
  • 8 run pagerank on twitter hashtags - GraphLab and MR by Yogesh Dalal, Dongzhen Piao
  • 9 twitter Spam by Elmer Garduno Hernandez

May 3 --- other stuff

  • 1 SVO relation classification by Mridul Gupta, Anuroop Sriram, Mahaveer
  • 2 WSD trained on WP sense-page links by Andrew Rodriguez
  • 3 GPS smoothing for bus data by Eliot Knudsen
  • 4 memory-caching HDFS by Philip Brown
  • 5 stacked LSH-based predictors by Benjamin Eckart
  • 6 Netflix - ensemble of matrix factorization and kNN by Peng Zhang
  • 7 hybrid parallel SGD coordinate descent by Chong Tat Chua, Dongyeop Kang
  • 8 RCV1 heirarchical classification by Tarun Sharma
  • 9 brains and nouns by Seshadri Sridharan