Machine Learning with Large Datasets 10-605 in Fall 2015

From Cohen Courses
Jump to navigationJump to search

Instructor and Venue

  • Instructor: William Cohen, Machine Learning Dept and LTI
  • When/where: Tu-Thu 4:30-5:50pm in DH 2210
  • Course Number: ML 10-605 and 10-805
  • Prerequisites:
    • a CMU intro machine learning course (e.g., 10-701, 10-715 or 10-601).
      • You may take this concurrently with 601/701 with the instructor's permission.
    • Good programming skills, e.g., 15-210, or 15-214 or equivalent.
  • Course staff:
    • William Cohen
    • TAs:
      • Ankit Agarwal ankitaga@andrew.cmu.edu (CSD MS program)
      • Aurick Qiao aqiao@cs.cmu.edu (CSD PhD program)
      • Dheeru Dua ddua@andrew.cmu.edu (MIIS program)
      • Iosef Kaver ikaveror@andrew.cmu.edu (CSD MS program)
      • Kavya Srinet ksrinet@cs.cmu.edu (MCDS program)
      • Suraj Dharmapuram sdharmap@cs.cmu.edu (MCDS)
      • Tian Jin tjin1@andrew.cmu.edu (CSD junior)
      • Vincy Binbin Xiong bxiong@andrew.cmu.edu (INI program)
    • Sandy Winkler (sandyw@cs.cmu.edu, GHC 8219) - course secretary
  • Syllabus: Syllabus for Machine Learning with Large Datasets 10-605 in Fall 2015 (subject to changes)

Office hours

Dheeru Dua Thursday 12:00 - 1:00pm GHC 5417

See Piazza

Important virtual places/account information

update these

  • Autolab page: for assignments
  • Blackboard page: grades for quizzes, finals and other non-autolab tests will be posted here.
  • Piazza page for class: home for questions and discussion.
    • Sign up
    • Piazza home for this class
    • Unlike the lower-level CS courses, the staff for this class is small, so do NOT expect immediate answers to questions.
    • For mailing the instructors questions NOT of general interest (eg, "My girlfriend is in town, can I have an extension?") use the emailing list, 10605-Instructors@cs.cmu.edu
  • MediaTech page for on-line versions of lectures:
    • Lectures should be posted within 24 hours.
    • Notes:
      • The lecture on 9/1 was not recorded.
      • The lecture 9/17 didn't have a slide feed, it's just video.
  • AFS data repository /afs/cs.cmu.edu/project/bigML
  • Hadoop cluster information - students should have an account on the OpenCloud cluster.
    • The location of the streaming JAR for this cluster is: /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar
  • Amazon Elastic MapReduce information
  • Guide for Happy Hadoop Hacking - Malcolm's advice on effectively writing Hadoop programs.
  • For TAs/instructors only:
    • Configuring Hadoop jobs on Autolab [1]
    • Autolab AFS dir /afs/cs.cmu.edu/academic/class/10605-f15 - students cannot read this
    • Planning gDoc: - students cannot read this

Description

Large datasets are difficult to work with for several reasons. They are difficult to visualize, and it is difficult to understand what sort of errors and biases are present in them. They are computationally expensive to process, and often the cost of learning is hard to predict - for instance, and algorithm that runs quickly in a dataset that fits in memory may be exorbitantly expensive when the dataset is too large for memory. Large datasets may also display qualitatively different behavior in terms of which learning methods produce the most accurate predictions.

This course is intended to provide a student practical knowledge of, and experience with, the issues involving large datasets. Among the issues considered are: scalable learning techniques, such as streaming machine learning techniques; parallel infrastructures such as map-reduce; practical techniques for reducing the memory requirements for learning methods, such as feature hashing and Bloom filters; and techniques for analysis of programs in terms of memory, disk usage, and (for parallel methods) communication complexity.

Syllabus

I'm mostly following previous versions of the class, as posted below:

Recitations

TBA

Prerequisites for 10-605

An introductory course in machine learning, like 10-601 or 10-701, is a prerequisite or a co-requisite. If you plan to take this course and 10-601 concurrently please tell the instructor.

The course will include several substantial programming assignments, so an additional prerequisite is 15-210, or 15-214, or comparable familiarity with Python and Java, and good programming skills.

Self-assessment for students:

  • Students, especially graduate students, come to CMU with a variety of different backgrounds, so formal course prereqs are hard to establish. There is a short self-assessment test to see if you have the necessary background for 10-605. We recommend that all students take this before enrolling in 10-605 to see if they have the necessary background knowledge already, or if they need to review and/or take additional courses. Note: this is the same self-assessment I used for 10-601 last fall - you don't need to worry about it if you've had 10-601 and done well. Also, Section 4 can be skipped.

10-605 vs 10-805 and 11-805

10-605 will be similar to previous years. Several assignments will be substantially updated, however. 10-605 students are graded based mainly on programming assignments and a final.

10-805 is a graduate seminar that will share lectures (and some assignments) with 10-605. 10-805 has a restricted enrollment: to enroll you must be enrolled in an SCS PhD program or have the consent of the instructor. 10-805 is a project course: the main grade is based on a student-defined open-ended research project, which will be presented to the class, and also written up in a conference-paper format.

10-805 will share lectures with 10-605, but grading and expectations are quite different.

10-605 students also have an option to collaborate with 805 students on projects.

Grading Policies

The grading policy for 10-605 for students not participating in a project is:

  • 60% assignments. There will be seven assignments, and you must do at least six of the seven. (I.e., you can drop one).
  • 15% for an in-class midterm exam.
  • 20% for an in-class final exam.
  • 5% class participation and in-class quizzes.

The grading policy for 10-605 for students participating in a project is:

  • 50% assignments. There will be seven assignments, and you must do at least five of the seven. (I.e., you can drop two).
  • 15% for an in-class midterm exam.
  • 30% for your part in the final project (you need not take the final exam).
  • 5% class participation and in-class quizzes.

The grading policy for 10-805 is

  • 40% assignments. There will be seven assignments, and you must do at least four of the seven. (I.e., you can drop three).
  • 15% for an in-class midterm exam.
  • 40% for a final project (you need not take the final exam).
    • The project will be an open-ended project, and should be done in teams. The final deliverable is a conference-length paper.
  • 5% class participation and in-class quizzes.

Project Info

Project suggestions from previous years:

Sample projects from previous years:

Your project writeup will be submitted in ACM format and should be between 6 and 12 pages, including a bibliography. You should include at least these parts. ((Not necessarily in this order).

  • An introduction, motivating the work from a technical point of view, and summarizing what was done.
  • Related work, outlining your contributions and intended contributions.
  • A detailed description of algorithms you've used (and especially, ones you implemented).
  • Clearly described, reproducible experiments, which typically will compare some new method/approach to a baseline, or describe a new application of an existing set of techniques.
  • A conclusion, where you summarize what was done. If your new ideas proved to be unsuccessful (which does happen for class projects) you should also discuss plausible reasons why the technique didn't work, and what would seem to be the most promising approach to continuing this work if you had time.
  • A bibliography of complete citations, which will be primarily be textbooks or technical papers, not secondary sources (eg Wikipedia).

Prerequisites for 10-805

In addition to the prerequisites for 10-605, students in 10-805 will be expected to have additional mathematical maturity and research skills.

Policies and FAQ

  • Can I take the class pass/fail? Or, can I audit? My policy is to give priority to students that are taking the class for a grade, so you cannot sign up for the class pass/fail or as an audit unless the waitlist clears. However, I'm hopeful that this spring there will be no waitlist - we have a large room, and the class was offered quite recently.
  • Can I get an extension on ....? Generally no, but you can get 50% credit for up to 48 hrs after the assignment is due, and you can drop your lowest assignment grade. If you have a documented medical issue or something similar email William.
  • I'm in 605 and I want to switch to 805 - what do I do? If you're in a PhD program, just sign up. Otherwise, email William your cv and anything else you think is relevant and ask him. The main criterion is experience in doing research projects.
  • I'm in 605 and I want to participate in a project with an 805 student - what do I do? Wait until the project proposals are due - they will be posted and there will be a chance to volunteer for projects then.
  • What do I need to do if I want to audit? attend the lectures and sit for the mid-term and final, and quizzes. You don't need to study for the exams - mainly I'm interested to know how much you've absorbed in an audit.
  • Will 605 be offered in spring 2016? No, the next time it will be offered is fall 2016.

Policy on Collaboration among Students

The purpose of student collaboration is to facilitate learning, not to circumvent it. Studying the material in groups is strongly encouraged. It is also allowed to seek help from other students in understanding the material needed to solve a particular homework problem, provided no written notes are shared, or are taken at that time, and provided learning is facilitated, not circumvented. The actual solution must be done by each student alone, and the student should be ready to reproduce their solution upon request.

The presence or absence of any form of help or collaboration, whether given or received, must be explicitly stated and disclosed in full by all involved, on the first page of their assignment. Specifically, each assignment solution must start by answering the following questions:

(1) Did you receive any help whatsoever from anyone in solving this assignment? Yes / No.
If you answered 'yes', give full details: _______________ (e.g. "Jane explained to me what is asked in Question 3.4")
(2) Did you give any help whatsoever to anyone in solving this assignment? Yes / No.
If you answered 'yes', give full details: _______________ (e.g. "I pointed Joe to section 2.3 to help him with Question 2".

Additionally, if you share any material or collaborate in any way between the time the assignment is due and the last time when the assignment can be handed in for partial credit, you must notify the instructor of this help in writing (eg via email).

Collaboration without full disclosure will be handled severely, in compliance with CMU's Policy on Cheating and Plagiarism. Except in usual extenuating circumstances, my policy is to fail the student(s) for the entire course.

As a related point, many of the homework assignments used in this class may have been used in prior versions of this class, or in classes at other institutions. Avoiding the use of heavily tested assignments will detract from the main purpose of these assignments, which is to reinforce the material and stimulate thinking. Because some of these assignments may have been used before, solutions to them may be (or may have been) available online, or from other people. It is explicitly forbidden to use any such sources, or to consult people who have solved these problems before. You must solve the homework assignments completely on your own. I will mostly rely on your wisdom and honor to follow this rule, but if a violation is detected it will be dealt with harshly. Collaboration with other students who have previously taken the class is allowed, but only under the conditions stated below.

These policies are the same as were used in Dr. Rosenfeld's previous version of 601, and my version of 10-601 in fall 2014.