Difference between revisions of "Machine Learning with Large Datasets 10-605 in Spring 2014"

From Cohen Courses
Jump to navigationJump to search
Line 2: Line 2:
  
 
* Instructor: [http://www.cs.cmu.edu/~wcohen William Cohen], Machine Learning Dept and LTI
 
* Instructor: [http://www.cs.cmu.edu/~wcohen William Cohen], Machine Learning Dept and LTI
* When/where: 1:30-2:50 MW '''Dougherty Hall 1112''' (''not'' Hamberg Hall, as previously announced!)
+
* When/where: 1:30-2:50 MW '''Doherty Hall 1112''' (''not'' Hamberg Hall, as previously announced!)
 
* Course Number: ML 10-605
 
* Course Number: ML 10-605
 
* Prerequisites:  
 
* Prerequisites:  

Revision as of 19:04, 13 January 2014

Instructor and Venue

  • Instructor: William Cohen, Machine Learning Dept and LTI
  • When/where: 1:30-2:50 MW Doherty Hall 1112 (not Hamberg Hall, as previously announced!)
  • Course Number: ML 10-605
  • Prerequisites:
    • a machine learning course (e.g., 10-701 or 10-601). You may take this concurrently with the instructor's permission.
    • Java programming skills, e.g., 15-210, or 15-214.
  • Course staff:
    • William Cohen
    • William Wang (ww@cmu.edu) - TA
    • Siddarth Varia (varias@cs.cmu.edu) - TA
    • Chun Chen (chunc@andrew.cmu.edu) - grader
    • Sandy Winkler (sandyw@cs.cmu.edu, GHC 8219) - course secretary
  • Syllabus: Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2014
  • Office hours: TBA

Important virtual places

Description

Large datasets are difficult to work with for several reasons. They are difficult to visualize, and it is difficult to understand what sort of errors and biases are present in them. They are computationally expensive to process, and often the cost of learning is hard to predict - for instance, and algorithm that runs quickly in a dataset that fits in memory may be exorbitantly expensive when the dataset is too large for memory. Large datasets may also display qualitatively different behavior in terms of which learning methods produce the most accurate predictions.

This course is intended to provide a student practical knowledge of, and experience with, the issues involving large datasets. Among the issues considered are: scalable learning techniques, such as streaming machine learning techniques; parallel infrastructures such as map-reduce; practical techniques for reducing the memory requirements for learning methods, such as feature hashing and Bloom filters; and techniques for analysis of programs in terms of memory, disk usage, and (for parallel methods) communication complexity.

The class will include frequent programming assignments, and a one-month short project chosen by the student. The project should be relevant to the course - e.g., to compare the scalability of variant learning algorithms on datasets.

Syllabus

I'm following previous versions of the class, below:

One different: since this semester's class is so large, we will not have a course project component.

Prerequisites

An introductory course in machine learning, like 10-601 or 10-701, is a prerequisite or a co-requisite. If you plan to take this course and 10-601 concurrently please tell the instructor.

The course will include several substantial programming assignments, so an additional prerequisite is 15-210, or 15-214, or comparable familiarity with Java and good programming skills.

Self-assessment for students:

  • Students, especially graduate students, come to CMU with a variety of different backgrounds, so formal course prereqs are hard to establish. There is a short self-assessment test to see if you have the necessary background for 10-605. We recommend that all students take this before enrolling in 10-605 to see if they have the necessary background knowledge already, or if they need to review and/or take additional courses. Note: this is the same self-assessment I used for 10-601 last fall - you don't need to worry about it if you've had 10-601 and done well. Also, Section 4 can be skipped.