Machine Learning with Large Datasets 10-605 in Fall 2017
Contents
Instructor and Venue
- Instructor: William Cohen, Machine Learning Dept and LTI
- When/where: Porter Hall 100, 1:30-2:50pm, Tuesdays and Thursdays
- Course Number: ML 10-605 and 10-805
- Prerequisites:
- a CMU intro machine learning course (e.g., 10-701, 10-715, 10-601, 10-401).
- You may take this concurrently with 401/601/701 with the instructor's permission.
- Good programming skills, e.g., 15-210, or 15-214 or equivalent.
- a CMU intro machine learning course (e.g., 10-701, 10-715, 10-601, 10-401).
- Course staff:
- William Cohen
- TAs:
- Rose Catherine Kanjirathinkal rosecatherinek -at- cs.cmu.edu
- Anant Subramanian assubram -at- andrew.cmu.edu
- Bo Chen bo.chen.mt - at- gmail.com
- Hu Chen chenh1 -at- cs.cmu.edu
- Minxing Liu minxing1 -at- andrew.cmu.edu
- Ning Dong ndong1 -at- cs.cmu.edu
- Sarguna Padmanabhan (Janani) sjpadman -at- andrew.cmu.edu
- Tao Lin tao.lin -at- cs.cmu.edu
- Yifan “Nick” Yang yang1fan2 -at- gmail.com
- Yuhan Mao yuhanm -at- cs.cmu.edu
- Dorothy Holland-Minkley (dfh@cs.cmu.edu, GHC xxxx) - course admin
- Syllabus: Syllabus for Machine Learning with Large Datasets 10-605 in Fall 2017
Office hours
- See the link here: [1]
Important virtual places/account information
- Autolab page: for assignments
- Piazza page for class: home for questions and discussion. You should signup for the class here with your andrew email.
- For mailing the instructors questions NOT of general interest (eg, "My girlfriend is in town, can I have an extension?") use the emailing list, 10605-Instructors@cs.cmu.edu
- Recorded lectures: Panopto will handle video recording this year.
- Last year's YouTube channel for lectures: Machine Learning CMU 10-605 Fall 2016
- AFS data repository /afs/cs.cmu.edu/project/bigML
- Hadoop cluster information - students should have an account on the OpenCloud cluster.
- The location of the streaming JAR for this cluster is: /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar
- Amazon Elastic MapReduce information
- Guide for Happy Hadoop Hacking - Malcolm's advice on effectively writing Hadoop programs.
- For TAs/instructors only:
- Configuring Hadoop jobs on Autolab [2]
- Autolab AFS dir /afs/cs.cmu.edu/academic/class/10605-f17/ - students cannot read this. TAs, you may need to use 'aklog cs.cmu.edu' before accessing it.
- Planning gDoc: - students cannot read this
Description
Large datasets are difficult to work with for several reasons. They are difficult to visualize, and it is difficult to understand what sort of errors and biases are present in them. They are computationally expensive to process, and often the cost of learning is hard to predict - for instance, and algorithm that runs quickly in a dataset that fits in memory may be exorbitantly expensive when the dataset is too large for memory. Large datasets may also display qualitatively different behavior in terms of which learning methods produce the most accurate predictions.
This course is intended to provide a student practical knowledge of, and experience with, the issues involving large datasets. Among the issues considered are: scalable learning techniques, such as streaming machine learning techniques; parallel infrastructures such as map-reduce; practical techniques for reducing the memory requirements for learning methods, such as feature hashing and Bloom filters; and techniques for analysis of programs in terms of memory, disk usage, and (for parallel methods) communication complexity.
Course Outcomes
In summary, students who successfully complete the course should be able to:
- Implement non-trivial algorithms that make effective use of map-reduce infrastructure, like Hadoop, to work with text corpora and large graphs.
- Analyze the time and communication complexity of map-reduce algorithms.
- Discuss coherently the differences between different dataflow languages.
- Implement algorithms using dataflow langages.
- Implement streaming learning algorithms that make use of sparsity to efficiently process examples from a very large feature set.
- Understand and explain the algorithms used for automatic differentiation in deep-learning platforms such as Theano, Tensorflow, Torch, and PyTorch.
- Implement learning algorithms that make use of parameter servers.
- Explain the scalability issues in probabilistic models estimated by sampling methods, and discuss approaches to fast sampling.
- Understand the potential applications of large graphs to semisupervised an unsupervised learning problems.
- Implement a general framework for developing gradient-descent optimizers for machine learning applications.
- Explain the differences between, and recognize potential applications of, randomized data structures such as locality sensitive hashing, Bloom filters, random projections, and count-min sketches.
Syllabus
I'm mostly following previous versions of the class, as posted below:
- Syllabus for Machine Learning with Large Datasets 10-605 in Fall 2016
- Syllabus for Machine Learning with Large Datasets 10-605 in Fall 2015
- Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2015
- Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2014
- Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2013
- Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2012
Recitations
There will be no recitations in fall 2017.
Prerequisites for 10-605/805
An introductory course in machine learning (one of 10-401, 10-601, 10-701, or 10-715) is a prerequisite or a co-requisite. Except in very rare cases I will NOT approve substitutions---it's just not feasible for me to check equivalence of a different intro ML class, or set of classes, for a large number of students.
The course will include several substantial programming assignments, so an additional prerequisite is 15-210, or 15-214, or comparable familiarity with Python and Java, and good programming skills.
In addition to the prerequisites for 10-605, students in 10-805 will be expected to have additional mathematical maturity and research skills.
Self-assessment for students:
- Students, especially graduate students, come to CMU with a variety of different backgrounds, so formal course prereqs are hard to establish. There is a short self-assessment test to see if you have the necessary background for 10-605. We recommend that all students take this before enrolling in 10-605 to see if they have the necessary background knowledge already, or if they need to review and/or take additional courses. Note: this is the same self-assessment I used for 10-601 last fall - you don't need to worry about it if you've had 10-601 and done well. Also, Section 4 can be skipped.
10-605 vs 10-805 and 10-405
10-605 will be similar to previous years. Several assignments will be substantially updated, however. 10-605 students are graded based mainly on programming assignments and a final.
10-805 is a graduate seminar that will share lectures (and some assignments) with 10-605. 10-805 has a restricted enrollment: to enroll you must be enrolled in an SCS PhD program or have the consent of the instructor. 10-805 is a project course: the main grade is based on a student-defined open-ended research project, which will be presented to the class, and also written up in a conference-paper format.
10-805 will share lectures with 10-605, but grading and expectations are quite different.
10-605 students also have an option to collaborate with 805 students on projects.
In spring 2018 I will be teaching a course similar to 10-605 but aimed at undergraduates. (The course number hasn't been assigned yet but I'm hoping for 10-405.) If you're an undergrad you might want to wait for this.
Grading Policies
The grading policy for 10-605 for students not participating in a project is:
- 60% assignments. There will be seven assignments, and you must do at least six of the seven. (I.e., you can drop one).
- 15% for an in-class midterm exam.
- 20% for an in-class final exam.
- 5% class participation and in-class quizzes.
The grading policy for 10-605 for students participating in a project is:
- 50% assignments. There will be seven assignments, and you must do at least five of the seven. (I.e., you can drop two).
- 15% for an in-class midterm exam.
- 30% for your part in the final project (you need not take the final exam).
- 5% class participation and in-class quizzes.
The grading policy for 10-805 is
- 40% assignments. There will be seven assignments, and you must do at least four of the seven. (I.e., you can drop three).
- 15% for an in-class midterm exam.
- 40% for a final project (you need not take the final exam).
- The project will be an open-ended project, and should be done in teams. The final deliverable is a conference-length paper.
- 5% class participation and in-class quizzes.
Project Info
Project suggestions from previous years:
- Projects_for_Machine_Learning_with_Large_Datasets_10-605_in_Spring_2012
- Projects_for_Machine_Learning_with_Large_Datasets_10-605_in_Spring_2014
Sample projects from previous years:
You should do your project on a team of 2-3 people. Ask William if you want to work alone, or in a team of 4. Your project writeup will be submitted in ACM format and should be between 6 and 12 pages, including a bibliography. You should include at least these parts. ((Not necessarily in this order).
- An introduction, motivating the work from a technical point of view, and summarizing what was done.
- Related work, outlining your contributions and intended contributions.
- A detailed description of algorithms you've used (and especially, ones you implemented).
- Clearly described, reproducible experiments, which typically will compare some new method/approach to a baseline, or describe a new application of an existing set of techniques.
- A conclusion, where you summarize what was done. If your new ideas proved to be unsuccessful (which does happen for class projects) you should also discuss plausible reasons why the technique didn't work, and what would seem to be the most promising approach to continuing this work if you had time.
- A bibliography of complete citations, which will be primarily be textbooks or technical papers, not secondary sources (eg Wikipedia).
Policies and FAQ
- When should I email William about co-reqs/805/waitlist issues?
- If you are a master's student or undergrad and want to take 805, send William your cv (see below) and ask.
- If you're taking 10-605 and 10-401/601/701/7015 concurrently, you don't need to ask for permission - we can now check co-recs automatically. Just sign up.
- Is there a textbook? No: but I do have written notes for some topics (e.g., naive bayes and streaming) which are linked to from the wiki.
- I forgot to take a quiz - can I make it up? There are no makeups for the quizzes. Their purpose is to ensure that you follow along with the class and review the lectures promptly, and makeups would defeat that goal.
- Can I get off the waitlist? We're in the process of setting up an on-line section for 10-605, for students that are waitlisted.
- Can I take the class pass/fail? Or, can I audit? We will allow audits for the on-line section only. If you audit, you should take at least 50% of the quizzes (on-time, which means within 24 hours of the lecture) and sit for the exams (but you don't need to pass them).
- Can I get an extension on ....? Generally no, but you can get 50% credit for up to 48 hrs after the assignment is due, and you can drop your lowest assignment grade. If you have a documented medical issue or something similar email William.
- I'm in 605 and I want to switch to 805 - what do I do? If you're in a PhD program, just sign up. Otherwise, email William your cv and anything else you think is relevant and ask him. The main criterion is experience in doing research projects.
- Will 605/805 be offered in spring 2018? No, the next time it will be offered is fall 2018. I will be teaching an undergrad version of 10-605 in the spring, though, so if you're an undergrad you might wait for that.
- I'm completely stressed out about all my classes, and/or having a terrible time because... What should I do? Here's some common-sense advice (in this context "common-sense" means "nobody is surprised to hear this but nobody actually acts on it either":
- Take care of yourself. Eat well, make exercise a priority, getting enough sleep, and take some time to relax.
- Stay organized. Spend some time (say 60min) every week reviewing your long and short term goals, and then spend the rest of the time not obsessing about what's due, but working efficiently on your priorities. I personally recommend the gtd approach.
- Get help when you need it. All of us benefit from support sometimes. You are not alone. As a CMU student you have access to many resources, and asking for support sooner rather than later is often helpful. If you're experiencing any academic stress, difficult life events, or feelings like anxiety or depression, Counseling and Psychological Services (CaPS) is available: call 412-268-2922 or visit their website at http://www.cmu.edu/counseling/. Or consider reaching out to a friend, faculty or family member you trust.
- Look out for each other. If your friends or classmates are stressed out or depressed, talk to them and remind them that support is available. Importantly, if you or someone you know is feeling suicidal or in danger of self-harm, call someone immediately, day or night. You can call CaPS (412-268-2922), the Re:solve Crisis Network (888-796-8226) or, if the situation is life threatening, the police (on campus: CMU Police: 412-268-2323, and off campus: 911).
Policy on Collaboration among Students
The purpose of student collaboration is to facilitate learning, not to circumvent it. Studying the material in groups is strongly encouraged. It is also allowed to seek help from other students in understanding the material needed to solve a particular homework problem, provided no written notes are shared, or are taken at that time, and provided learning is facilitated, not circumvented. The actual solution must be done by each student alone, and the student should be ready to reproduce their solution upon request.
The presence or absence of any form of help or collaboration, whether given or received, must be explicitly stated and disclosed in full by all involved, on the first page of their assignment. Specifically, each assignment solution must start by answering the following questions:
(1) Did you receive any help whatsoever from anyone in solving this assignment? Yes / No. If you answered 'yes', give full details: _______________ (e.g. "Jane explained to me what is asked in Question 3.4") (2) Did you give any help whatsoever to anyone in solving this assignment? Yes / No. If you answered 'yes', give full details: _______________ (e.g. "I pointed Joe to section 2.3 to help him with Question 2".
Additionally, if you share any material or collaborate in any way between the time the assignment is due and the last time when the assignment can be handed in for partial credit, you must notify the instructor of this help in writing (eg via email).
Collaboration without full disclosure will be handled severely, in compliance with CMU's Policy on Cheating and Plagiarism. Except in usual extenuating circumstances, my policy is to fail the student(s) for the entire course.
As a related point, many of the homework assignments used in this class may have been used in prior versions of this class, or in classes at other institutions. Avoiding the use of heavily tested assignments will detract from the main purpose of these assignments, which is to reinforce the material and stimulate thinking. Because some of these assignments may have been used before, solutions to them may be (or may have been) available online, or from other people. It is explicitly forbidden to use any such sources, or to consult people who have solved these problems before. You must solve the homework assignments completely on your own. I will mostly rely on your wisdom and honor to follow this rule, but if a violation is detected it will be dealt with harshly. Collaboration with other students who have previously taken the class is allowed, but only under the conditions stated below.
These policies are the same as were used in Dr. Rosenfeld's previous version of 601, and my version of 10-601 in fall 2014.