Machine Learning with Large Datasets 10-605 in Spring 2015

From Cohen Courses
Jump to navigationJump to search

Instructor and Venue

  • Instructor: William Cohen, Machine Learning Dept and LTI
  • When/where: TR 10:30-11:50am in BH A51
  • Course Number: ML 10-605 and 10-805
  • Prerequisites:
    • a machine learning course (e.g., 10-701 or 10-601). You may take this concurrently with the instructor's permission.
    • Java programming skills, e.g., 15-210, or 15-214.
  • Course staff:
    • William Cohen
    • TAs and graders.
      • Dai Wei (wdai@andrew)
      • Abhinav Maurya (ahmaurya+10605@gmail.com)
      • Rahul Goutam (rgoutam@cs.cmu.edu)
      • Yun Ni (yunn@andrew.cmu.edu)
      • Yipei Wang (yipeiw@andrew.cmu.edu )
    • Sandy Winkler (sandyw@cs.cmu.edu, GHC 8219) - course secretary
  • Syllabus: Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2015 (subject to changes)

Office hours

Instructor/TA Day Time Location
William Cohen Friday 3-4pm GHC 8217
Dai Wei Tuesday 1:30 - 2:30pm GHC 8011
Abhinav Maurya Saturday 10:30am - 11:30am* Hamburg Hall 3030
Rahul Goutam Thursday 4pm - 5pm GHC 5508
Yun Ni Monday 3pm - 4pm Multi-purpose Room at Hunt Library Basement
Yipei Wang Wed 4:30pm - 5:30pm GHC 6405

* Please email by 10am if you plan to come to office hours.

10-605 vs 10-805 and 11-805

10-605 will be more-or-less the same as in previous years (modulo the usual course updates).

10-805 is a new course: basically, it is a graduate seminar that will share lectures (and some assignments) with 10-605.

11-805 just a cross-listing for 10-805: there's no difference in grading policies, etc. If you're an LTI student you probably want to use the LTI number.

The major grade for students in 805 will be based on a research project, to be presented as a poster, and a conference-length paper at the end of the semester. Students in 805 will also be responsible for selecting recent research results (after consultation with the instructor) and presenting them in class. I've reserved six class sessions for presentations of this sort, and we will have three presentations per session, which is why enrollment for 805 is capped at 18. (These replace the guest lectures I've had in past versions of 605, which were usually recent research results and/or presentations from industry.)

10/11-805 will share lectures with 10-605, but grading and expectations are quite different. 605 grades are mainly programming assignments and an exam. 10-805 students will do some programming assignments, but the main grades are for their presentations and the research project - which is open-ended. They are not required to take the final exam. Since, 805 students are expected to be capable of surveying recent literature and conducting research, to enroll in 10/11-805 you must be a PhD student, or get permission from the instructor.

If there is sufficient interest we may introduce a mechanism for 10-605 students to collaborate with 805 students on projects.

Grading Policies

The grading policy for 10-605 for students not participating in a project is:

  • 70% assignments. There will be eight assignments, and you must do at least seven of the eight. (I.e., you can drop one).
    • Note that several of the assignments are very cumulative, and that the final assignment is due close to the time for the in-class final.
  • 25% in-class exam.
  • 5% class participation and in-class quizzes.

The grading policy for 10-605 for students participating in a project is:

  • 60% assignments. There will be eight assignments, and you must do at least siz of the eight. (I.e., you can drop two).
  • 35% for your part in the final project.
  • 5% class participation and in-class quizzes.

The grading policy for 10/11-805 is

  • 40% assignments. There will be eight assignments, and you must do four of the eight. (I.e., you can drop four).
  • 20% for presenting a recent technical paper to the class, with an (approximately) 20-minute talk. This paper should be approved by William in advance.
  • 40% for a final project.
    • The project will be an open-ended project, and should be done in teams of two. The final deliverable is a conference-length paper.

Students participating in a project need not take the final exam.

The exam will be an in-class final on the last day of class (April 30th, 2015).

Project Info

For students wishing to participate in a project

There are two projects that are looking for other participants. If you're interested you should contact the lead students.

For students initiating a project

Project suggestions from previous years:

Sample projects from previous years:

Your project writeup will be submitted in ACM format and should be between 6 and 12 pages, including a bibliography. You should include at least these parts. ((Not necessarily in this order).

  • An introduction, motivating the work from a technical point of view, and summarizing what was done.
  • Related work, outlining your contributions and intended contributions.
  • A detailed description of algorithms you've used (and especially, ones you implemented).
  • Clearly described, reproducible experiments, which typically will compare some new method/approach to a baseline, or describe a new application of an existing set of techniques.
  • A conclusion, where you summarize what was done. If your new ideas proved to be unsuccessful (which does happen for class projects) you should also discuss plausible reasons why the technique didn't work, and what would seem to be the most promising approach to continuing this work if you had time.
  • A bibliography of complete citations, which will be primarily be textbooks or technical papers, not secondary sources (eg Wikipedia).

Important virtual places/account information

  • Autolab page: for programming assignments.
  • Piazza page for class: for questions and discussion.
    • Unlike the lower-level CS courses, the staff for this class is small, so do NOT expect immediate answers to questions.
    • For mailing the instructors questions NOT of general interest (eg, "My girlfriend is in town, can I have an extension?") you can also use the emailing list, 10605-Instructors@cs.cmu.edu
  • MediaTech page: for lecture videos
    • Lectures should be posted within 24 hours.
    • Last's year's lectures (spring 2014) are here
  • AFS data repository /afs/cs.cmu.edu/project/bigML
  • Hadoop cluster information
  • Guide for Happy Hadoop Hacking - Malcolm's advice on effectively writing Hadoop programs.
  • For TAs/instructors only:
    • Configuring Autolab [1]
    • Autolab AFS dir /afs/cs.cmu.edu/academic/class/10605-s15/autolab - students cannot read this
    • Planning gDoc: - students cannot read this

Description

Large datasets are difficult to work with for several reasons. They are difficult to visualize, and it is difficult to understand what sort of errors and biases are present in them. They are computationally expensive to process, and often the cost of learning is hard to predict - for instance, and algorithm that runs quickly in a dataset that fits in memory may be exorbitantly expensive when the dataset is too large for memory. Large datasets may also display qualitatively different behavior in terms of which learning methods produce the most accurate predictions.

This course is intended to provide a student practical knowledge of, and experience with, the issues involving large datasets. Among the issues considered are: scalable learning techniques, such as streaming machine learning techniques; parallel infrastructures such as map-reduce; practical techniques for reducing the memory requirements for learning methods, such as feature hashing and Bloom filters; and techniques for analysis of programs in terms of memory, disk usage, and (for parallel methods) communication complexity.

Syllabus

I'm following previous versions of the class, as posted below:

Some lectures (probably 4-6) will be reserved for presentation from students in 10-805.

Recitations

TA Day Time Location Topic Presentation
Abhinav 3 Feb 5:30-7pm Doherty Hall 2315 AWS PDF
Dai Wei 4 Feb 4:30-5:30pm Doherty Hall 1212 AWS EMR Google Doc (Comment allowed)
Dai Wei 20 Mar 4:30-5:30pm Doherty Hall 2315 Gephi Google Doc (Comment allowed)
Abhinav 3 April 12pm-1:30pm Hamburg Hall 1000 Spark PDF

Prerequisites for 10-605

An introductory course in machine learning, like 10-601 or 10-701, is a prerequisite or a co-requisite. If you plan to take this course and 10-601 concurrently please tell the instructor.

The course will include several substantial programming assignments, so an additional prerequisite is 15-210, or 15-214, or comparable familiarity with Java and good programming skills.

Self-assessment for students:

  • Students, especially graduate students, come to CMU with a variety of different backgrounds, so formal course prereqs are hard to establish. There is a short self-assessment test to see if you have the necessary background for 10-605. We recommend that all students take this before enrolling in 10-605 to see if they have the necessary background knowledge already, or if they need to review and/or take additional courses. Note: this is the same self-assessment I used for 10-601 last fall - you don't need to worry about it if you've had 10-601 and done well. Also, Section 4 can be skipped.

Prerequisites for 10-805

In addition to the prerequisites for 10-605, students in 10-805 will be expected to have additional mathematical maturity and research skills.

Policies and FAQ

  • Can I take the class pass/fail? Or, can I audit? My policy is to give priority to students that are taking the class for a grade, so you cannot sign up for the class pass/fail or as an audit unless the waitlist clears.
  • Can I get an extension on ....? No, but you can get 50% credit for up to 48 hrs after the assignment is due, and you can drop your lowest assignment grade.
  • Will I get off the waitlist? I don't know. It's not possible to enroll more people in the class than there are seats in the room, I don't personally administer the list, and I don't know of a fairer way to allocate spots than FIFO, so I probably will not change your position on the list. The waitlist cleared in 2012,2013, and 2014, if that's a comfort. We will also offer the course this fall, so you may have a chance then. Also, the lectures and assignments will be made available to everyone, even if you aren't officially enrolled.

Policy on Collaboration among Students

The purpose of student collaboration is to facilitate learning, not to circumvent it. Studying the material in groups is strongly encouraged. It is also allowed to seek help from other students in understanding the material needed to solve a particular homework problem, provided no written notes are shared, or are taken at that time, and provided learning is facilitated, not circumvented. The actual solution must be done by each student alone, and the student should be ready to reproduce their solution upon request.

The presence or absence of any form of help or collaboration, whether given or received, must be explicitly stated and disclosed in full by all involved, on the first page of their assignment. Specifically, each assignment solution must start by answering the following questions:

(1) Did you receive any help whatsoever from anyone in solving this assignment? Yes / No.
If you answered 'yes', give full details: _______________ (e.g. "Jane explained to me what is asked in Question 3.4")
(2) Did you give any help whatsoever to anyone in solving this assignment? Yes / No.
If you answered 'yes', give full details: _______________ (e.g. "I pointed Joe to section 2.3 to help him with Question 2".

Collaboration without full disclosure will be handled severely, in compliance with CMU's Policy on Cheating and Plagiarism. Except in usual extenuating circumstances, my policy is to fail the student(s) for the entire course.

As a related point, many of the homework assignments used in this class may have been used in prior versions of this class, or in classes at other institutions. Avoiding the use of heavily tested assignments will detract from the main purpose of these assignments, which is to reinforce the material and stimulate thinking. Because some of these assignments may have been used before, solutions to them may be (or may have been) available online, or from other people. It is explicitly forbidden to use any such sources, or to consult people who have solved these problems before. You must solve the homework assignments completely on your own. I will mostly rely on your wisdom and honor to follow this rule, but if a violation is detected it will be dealt with harshly. Collaboration with other students who have previously taken the class is allowed, but only under the conditions stated below.

These policies are the same as were used in Dr. Rosenfeld's previous version of 601, and my version of 10-601 in fall 2014.