Machine Learning with Large Datasets 10-605 in Spring 2015

From Cohen Courses
Jump to navigationJump to search

Instructor and Venue

  • Instructor: William Cohen, Machine Learning Dept and LTI
  • When/where: TR 10:30-11:50am in BH A51
  • Course Number: ML 10-605 and 10-805
  • Prerequisites:
    • a machine learning course (e.g., 10-701 or 10-601). You may take this concurrently with the instructor's permission.
    • Java programming skills, e.g., 15-210, or 15-214.
  • Course staff:
    • William Cohen - office hour TBA GHC 8217
    • TAs and graders.
      • Dai Wei (wdai@andrew)
      • Abhinav Maurya (amaurya@andrew.cmu.edu)
      • Rahul Goutam (rgoutam@cs.cmu.edu)
      • Yun Ni (yunn@andrew.cmu.edu)
      • Yipei Wang (yipeiw@andrew.cmu.edu )
    • Sandy Winkler (sandyw@cs.cmu.edu, GHC 8219) - course secretary
  • Syllabus: not yet posted, but it will roughly follow the outline of last's years course, with a few adjustments.

10-605 vs 10-805 and 11-805

11-805 just a cross-listing for 10-805: there's no difference in grading policies, etc. If you're an LTI student you probably want to use the LTI number.

10/11-805 will share lectures with 10-605, but 805 students need to make class presentations and complete a research project, and will do fewer programming assignments. So 805 students are expected to be capable of surveying recent literature and conducting research. Several lecture sessions for 10-605 will also be reserved for 805 students' presentations. To attend 10/11-805 you must be a PhD student, or be in the MLD's MS program, or get permission from the instructor.

Essentially 10/11-805 will be conducted as a graduate seminar that will share lectures (and some assignments) with 10-605. The major grade for students in 805 will be based on a research project, to be presented as a conference-length paper at the end of the semester. Students in 805 will also be responsible for selecting recent research results (after consultation with the instructor) and presenting them in class.

If there is sufficient interest we will introduce a mechanism for 10-605 students to collaborate with 805 students on projects.

The grading policy for 10-605 is

  • 70% assignments. There will be eight assignments, and you must do at least seven of the eight. (I.e., you can drop one).
    • Note that several of the assignments are very cumulative, and that the final assignment is due close to the time for the in-class final.
  • 25% in-class exam.
  • 5% class participation.

The grading policy for 10/11-805 is

  • 40% assignments. There will be eight assignments, and you must do four of the eight. (I.e., you can drop four).
  • 20% in-class exam.
  • 10% for presenting a recent technical paper to the class, with an (approximately) 20-minute talk. This paper should be approved by William in advance.
  • 30% for a final project.
    • The project will be an open-ended project, and should be done in teams of two. The final deliverable is a conference-length paper.

Important virtual places/account information

  • Autolab page: for programming assignments.
  • Piazza page for class: for questions and discussion.
    • Unlike the lower-level CS courses, the staff for this class is small, so do NOT expect immediate answers to questions.
    • For mailing the instructors questions NOT of general interest (eg, "My girlfriend is in town, can I have an extension?") you can also use the emailing list, 10605-Instructors@cs.cmu.edu
  • MediaTech page for lectures: TBA
    • Lectures should be posted within 24 hours.
  • AFS data repository /afs/cs.cmu.edu/project/bigML
  • Hadoop cluster information.
  • Guide for Happy Hadoop Hacking - Malcolm's advice on effectively writing Hadoop programs.
  • For TAs/instructors only:
    • Autolab AFS dir /afs/cs.cmu.edu/academic/class/10605-s15/autolab - students cannot read this
    • Planning gDoc: - students cannot read this

Description

Large datasets are difficult to work with for several reasons. They are difficult to visualize, and it is difficult to understand what sort of errors and biases are present in them. They are computationally expensive to process, and often the cost of learning is hard to predict - for instance, and algorithm that runs quickly in a dataset that fits in memory may be exorbitantly expensive when the dataset is too large for memory. Large datasets may also display qualitatively different behavior in terms of which learning methods produce the most accurate predictions.

This course is intended to provide a student practical knowledge of, and experience with, the issues involving large datasets. Among the issues considered are: scalable learning techniques, such as streaming machine learning techniques; parallel infrastructures such as map-reduce; practical techniques for reducing the memory requirements for learning methods, such as feature hashing and Bloom filters; and techniques for analysis of programs in terms of memory, disk usage, and (for parallel methods) communication complexity.

Syllabus

I'm following previous versions of the class, as posted below:

Some lectures (probably 4-6) will be reserved for presentation from students in 10-805.

Prerequisites for 10-605

An introductory course in machine learning, like 10-601 or 10-701, is a prerequisite or a co-requisite. If you plan to take this course and 10-601 concurrently please tell the instructor.

The course will include several substantial programming assignments, so an additional prerequisite is 15-210, or 15-214, or comparable familiarity with Java and good programming skills.

Self-assessment for students:

  • Students, especially graduate students, come to CMU with a variety of different backgrounds, so formal course prereqs are hard to establish. There is a short self-assessment test to see if you have the necessary background for 10-605. We recommend that all students take this before enrolling in 10-605 to see if they have the necessary background knowledge already, or if they need to review and/or take additional courses. Note: this is the same self-assessment I used for 10-601 last fall - you don't need to worry about it if you've had 10-601 and done well. Also, Section 4 can be skipped.

Prerequisites for 10-805

In addition to the prerequisites for 10-605, students in 10-805 will be expected to have additional mathematical maturity and research skills.

Policies and FAQ

  • Can I take the class pass/fail? Or, can I audit? My policy is to give priority to students that are taking the class for a grade, so you cannot sign up for the class pass/fail or as an audit unless the waitlist clears.
  • Can I get an extension on ....? No, but you can get 50% credit for up to 48 hrs after the assignment is due, and you can drop your lowest assignment grade.

Policy on Collaboration among Students

The purpose of student collaboration is to facilitate learning, not to circumvent it. Studying the material in groups is strongly encouraged. It is also allowed to seek help from other students in understanding the material needed to solve a particular homework problem, provided no written notes are shared, or are taken at that time, and provided learning is facilitated, not circumvented. The actual solution must be done by each student alone, and the student should be ready to reproduce their solution upon request.

The presence or absence of any form of help or collaboration, whether given or received, must be explicitly stated and disclosed in full by all involved, on the first page of their assignment. Specifically, each assignment solution must start by answering the following questions:

(1) Did you receive any help whatsoever from anyone in solving this assignment? Yes / No.
If you answered 'yes', give full details: _______________ (e.g. "Jane explained to me what is asked in Question 3.4")
(2) Did you give any help whatsoever to anyone in solving this assignment? Yes / No.
If you answered 'yes', give full details: _______________ (e.g. "I pointed Joe to section 2.3 to help him with Question 2".

Collaboration without full disclosure will be handled severely, in compliance with CMU's Policy on Cheating and Plagiarism. Except in usual extenuating circumstances, my policy is to fail the student(s) for the entire course.

As a related point, many of the homework assignments used in this class may have been used in prior versions of this class, or in classes at other institutions. Avoiding the use of heavily tested assignments will detract from the main purpose of these assignments, which is to reinforce the material and stimulate thinking. Because some of these assignments may have been used before, solutions to them may be (or may have been) available online, or from other people. It is explicitly forbidden to use any such sources, or to consult people who have solved these problems before. You must solve the homework assignments completely on your own. I will mostly rely on your wisdom and honor to follow this rule, but if a violation is detected it will be dealt with harshly. Collaboration with other students who have previously taken the class is allowed, but only under the conditions stated below.

These policies are the same as were used in Dr. Rosenfeld's previous version of 601, and my version of 10-601 in fall 2014.