Difference between revisions of "Machine Learning with Large Datasets 10-405 in Spring 2018"

From Cohen Courses
Jump to navigationJump to search
(Created page with "There is as yet no web page for 10-405, but it will be organized similarly to 10-605.")
 
 
(41 intermediate revisions by 5 users not shown)
Line 1: Line 1:
There is as yet no web page for 10-405, but it will be organized similarly to 10-605.
+
== Instructor and Venue ==
 +
 
 +
* Instructor: [http://www.cs.cmu.edu/~wcohen William Cohen], Machine Learning Dept and LTI
 +
* When/where:  GHC 4307, 3-4:30pm, Mondays and Wednesdays
 +
* Course Number: ML 10-405
 +
* Prerequisites:
 +
** a CMU intro machine learning course (e.g., 10-701, 10-715, 10-601, 10-401). 
 +
*** You may take this 401/601/701 concurrently with 405. 
 +
** Good programming skills, e.g., 15-210, or 15-214 or equivalent.
 +
* Course staff:
 +
** William Cohen
 +
** TAs:
 +
*** Vivek Shankar vshanka1@andrew.cmu.edu
 +
*** Nitish Kumar Kulkarni nitishkk@andrew.cmu.edu
 +
*** Vidhan Agarwal vidhana@andrew.cmu.edu
 +
*** Sarthak Garg sarthakg@andrew.cmu.edu
 +
 
 +
** Dorothy Holland-Minkley (dfh@cs.cmu.edu, GHC 8001) - ''course admin''
 +
* Syllabus: [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018]]
 +
 
 +
== Office hours ==
 +
 
 +
* William: Monday 11am-12, 8217 Gates Hillman
 +
* Vidhan: Tuesday 1:00pm-2:00pm, GHC 6th floor commons (opposite 6105)
 +
* Sarthak: Wednesday 4:30pm-5:30pm, GHC 6th floor commons (opposite 6105)
 +
* Vivek: Thursday 4:30pm-5:30pm, GHC 5th floor commons
 +
* Nitish: Friday 3:00pm-4:00pm, GHC 6th floor commons (opposite 6105)
 +
 
 +
== Important virtual places/account information ==
 +
 
 +
''Mostly these are TBA''
 +
 
 +
* [https://autolab.andrew.cmu.edu/courses/10405-s18/assessments Autolab page]: for assignments
 +
* [http://piazza.com/cmu/spring2018/10405/home Piazza page for class]: home for questions and discussion.  You should [http://piazza.com/cmu/spring2018/10405 sign up] for the class with your andrew email.
 +
** For mailing the instructors questions NOT of general interest (eg, "My girlfriend is in town, can I have an extension?") use the emailing list, 10405-Instructors@cs.cmu.edu
 +
* Recorded lectures: Lectures are not recorded for 10-405.  However, many of the lectures overlap with 10-605, and recordings of those lectures are still available.
 +
** Previous lectures from 10-605: [https://scs.hosted.panopto.com/Panopto/Pages/Sessions/List.aspx#folderID=7ae48b11-88af-4f02-b179-28f65ea796a4 Fall 2017]
 +
* AFS data repository  ''/afs/cs.cmu.edu/project/bigML''
 +
* [https://wiki.pdl.cmu.edu/Stoat Hadoop cluster information] - students will recieve an account on the OpenCloud cluster.
 +
** The location of the streaming JAR for this cluster is: /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar
 +
* [[Amazon Elastic MapReduce information]]
 +
* [[Guide for Happy Hadoop Hacking]] - Malcolm's advice on effectively writing Hadoop programs.
 +
* For TAs/instructors ''only'':
 +
** Configuring Hadoop jobs on Autolab [http://curtis.ml.cmu.edu/w/courses/index.php/Guide_for_Configuring_Autolab]
 +
** Autolab AFS dir ''/afs/cs.cmu.edu/academic/class/10405-s18/'' - students cannot read this.  TAs, you may need to use 'aklog cs.cmu.edu' before accessing it.
 +
** [https://docs.google.com/document/d/13uGcV-Alqe0Y_WWqRrdGs7VTn8CeZSvxrG9GK4nx5BU/edit#heading=h.my65mtwmsx4h Planning gDoc]:  - students cannot read this
 +
 
 +
== Description ==
 +
 
 +
Large datasets are difficult to work with for several reasons.  They are difficult to visualize, and it is difficult to understand what sort of errors and biases are present in them.  They are computationally expensive to process, and often the cost of learning is hard to predict - for instance, and algorithm that runs quickly in a dataset that fits in memory may be exorbitantly expensive when the dataset is too large for memory.  Large datasets may also display qualitatively different behavior in terms of which learning methods produce the most accurate predictions.
 +
 
 +
This course is intended to provide a student practical knowledge of, and experience with, the issues involving large datasets.  Among the issues considered are: scalable learning techniques, such as streaming machine learning techniques; parallel infrastructures such as map-reduce; practical techniques for reducing the memory requirements for learning methods, such as feature hashing and Bloom filters; and techniques for analysis of programs in terms of memory, disk usage, and (for parallel methods) communication complexity.
 +
 
 +
=== Course Outcomes ===
 +
 
 +
In summary, students who successfully complete the course should be able to:
 +
 
 +
* Implement non-trivial algorithms that make effective use of map-reduce infrastructure, like Hadoop, to work with text corpora and large graphs.
 +
* Analyze the time and communication complexity of map-reduce algorithms.
 +
* Discuss coherently the differences between different dataflow languages.
 +
* Implement algorithms using dataflow langages.
 +
* Implement streaming learning algorithms that make use of sparsity to efficiently process examples from a very large feature set.
 +
* Understand and explain  the algorithms used for automatic differentiation in deep-learning platforms such as Theano, Tensorflow, Torch, and PyTorch.
 +
* Implement learning algorithms that make use of parameter servers.
 +
* Explain the scalability issues in probabilistic models estimated by sampling methods, and discuss approaches to fast sampling.
 +
* Understand the potential applications of large graphs to semisupervised an unsupervised learning problems.
 +
* Implement a general framework for developing gradient-descent optimizers for machine learning applications.
 +
* Explain the differences between, and recognize potential applications of,  randomized data structures such as locality sensitive hashing, Bloom filters, random projections, and count-min sketches.
 +
 
 +
== Syllabus ==
 +
 
 +
* [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018]]
 +
 
 +
While 10-405 is new, it covers similar material to 10-605.  Here are some previous syllabi.
 +
 
 +
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Fall 2017]]
 +
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Fall 2016]]
 +
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Fall 2015]]
 +
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2015]]
 +
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2014]]
 +
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2013]]
 +
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2012]]
 +
 
 +
== Recitations ==
 +
 
 +
TBA
 +
 
 +
== Prerequisites  ==
 +
 
 +
An introductory course in machine learning (one of 10-401, 10-601, 10-701, or 10-715) is a prerequisite or a co-requisite.  Except in very rare cases I will NOT approve substitutions---it's just not feasible for me to check equivalence of a different  intro ML class, or set of classes, for a large number of students.
 +
 
 +
The course will include several substantial programming assignments, so an additional prerequisite is 15-210, or 15-214, or familiarity with Python, Unix, and good programming skills.
 +
 
 +
== 10-605 vs 10-405 ==
 +
 
 +
10-405 will be similar in scope to 10-605 as offered in previous years.  Students are graded based mainly on programming assignments, a midterm, and a final. 
 +
 
 +
=== Grading Policies ===
 +
 
 +
* 60% assignments.  There will be six assignments.
 +
** For most assignments there will be a "checkpoint" deliverable partway thru the assignment:
 +
*** The goal of the check point is to ensure you started the real assignment. 
 +
*** It will be worth usually 10 points out of 100.
 +
*** It will not be autolab-graded, so it doesn't need to execute.  But if you turn in nothing, or something that clearly doesn't come close to the checkpoint goals, you will be penalized.
 +
* 15% for an in-class midterm exam.
 +
* 20% for an in-class final exam.
 +
* 5% class participation and in-class quizzes.
 +
 
 +
* You have the option of replacing one assignment with an "open-ended extension"
 +
** Roughly, this would be an extension to an existing assignment that would increase the programming effort by at least 50%.
 +
** Your deliverable is a handout for the extension, like a TA would produce, and a solution key, including code.
 +
** You should send the instructors email list a rough draft as early as you can,  where you sketch out the technical approach you will follow, and mention any research or exploration that you've done to reduce technical risk. 
 +
*** ''For example,'' if you decide to implement IPM, will you use All-Reduce or not? if you use it, will you implement it or use an existing implementation, and if it's an existing implementation, have you verified that it works properly on your target cluster?  What's the strategy that you will use to ensure mappers don't reload the data from network?
 +
*** We will try and give you feedback promptly, but you should allocate a few days for us to look over these and approve them.
 +
** Example: for HW1B, implement the Rocchio algorithm as well as naive Bayes and compare them.
 +
 
 +
== Policies and FAQ ==
 +
 
 +
* If you're taking 10-405 and 10-401/601/701/715 concurrently, you don't need to ask for permission - we can now check co-recs automatically.  Just sign up.
 +
* '''Is there a textbook?''' No: but I do have written notes for some topics (e.g., [http://www.cs.cmu.edu/~wcohen/10-605/notes/scalable-nb-notes.pdf naive bayes and streaming]) which are linked to from the wiki.
 +
* '''I forgot to take a quiz - can I make it up?'''  There are no makeups for the quizzes.  Their purpose is to ensure that you follow along with the class and review the lectures promptly, and makeups would defeat that goal.
 +
* '''Can I get an extension on ....?''' Generally no, but you can get 50% credit for up to 48 hrs after the assignment is due, and there are a fixed number of grace days.  If you have a documented medical issue or something similar email William.
 +
* '''I'm completely stressed out about all my classes, and/or having a terrible time because... What should I do?'''  Here's some common-sense advice (in this context "common-sense" means "nobody is surprised to hear this but nobody actually acts on it either":
 +
** ''Take care of yourself.''  Eat well, make exercise a priority, getting enough sleep, and take some time to relax.
 +
** ''Stay organized.''  Spend some time (say 60min) every week reviewing your long and short term goals, and then spend the rest of the time not obsessing about what's due, but working efficiently on your priorities.  I personally recommend the [https://en.wikipedia.org/wiki/Getting_Things_Done gtd] approach.
 +
** ''Get help when you need it.''  All of us benefit from support sometimes. You are not alone. As a CMU student you have access to many resources, and asking for support sooner rather than later is often helpful.  If you're experiencing any academic stress, difficult life events, or feelings like anxiety or depression, Counseling and Psychological Services (CaPS) is available: call 412-268-2922 or visit their website at http://www.cmu.edu/counseling/.  Or consider reaching out to a friend, faculty or family member you trust.
 +
** ''Look out for each other.'' If your friends or classmates are stressed out or depressed, talk to them and remind them that support is available.  Importantly, if you or someone you know is feeling suicidal or in danger of self-harm, call someone immediately, day or night.  You can call CaPS (412-268-2922), the Re:solve Crisis Network (888-796-8226) or, if the situation is life threatening, the police (on campus: CMU Police: 412-268-2323, and off campus: 911).
 +
 
 +
=== Policy on Collaboration among Students  ===
 +
 
 +
The purpose of student collaboration is to facilitate learning, not to circumvent it. Studying the material in groups is strongly encouraged. It is also allowed to seek help from other students in understanding the material needed to solve a particular homework problem, provided '''no written notes''' are shared, or are taken at that time, and provided learning is facilitated, not circumvented. The actual solution must be done by each student alone, and the student should be ready to reproduce their solution upon request.
 +
 
 +
'''The presence or absence of any form of help or collaboration, whether given or received, must be explicitly stated and disclosed in full by all involved''', on the first page of their assignment. Specifically, each assignment solution must start by answering the following questions:
 +
 
 +
(1) Did you receive any help whatsoever from anyone in solving this assignment? Yes / No.
 +
If you answered 'yes', give full details: _______________ (e.g. "Jane explained to me what is asked in Question 3.4")
 +
(2) Did you give any help whatsoever to anyone in solving this assignment? Yes / No.
 +
If you answered 'yes', give full details: _______________ (e.g. "I pointed Joe to section 2.3 to help him with Question 2".
 +
 
 +
Additionally, if you share any material or collaborate in any way between the time the assignment is due and the last time when the assignment can be handed in for partial credit, you must notify the instructor of this help in writing (eg via email).
 +
 
 +
Collaboration without full disclosure will be handled severely, in compliance with CMU's Policy on Cheating and Plagiarism.  Except in usual extenuating circumstances, my policy is to '''fail the student(s) for the entire course'''.
 +
 
 +
As a related point, many of the homework assignments used in this class may have been used in prior versions of this class, or in classes at other institutions.  Avoiding the use of heavily tested assignments will detract from the main purpose of these assignments, which is to reinforce the material and stimulate thinking.  Because some of these assignments may have been used before, solutions to them may be (or may have been) available online, or from other people.  It is explicitly forbidden to use any such sources, or to consult people who have solved these problems before.  '''You must solve the homework assignments completely on your own'''. I will mostly rely on your wisdom and honor to follow this rule, but if a violation is detected it will be dealt with harshly.  Collaboration with other students who have previously taken the class is allowed, but only under the conditions stated below.
 +
 
 +
These policies are the same as were used in [http://www.cs.cmu.edu/~roni/10601/ Dr. Rosenfeld's previous version of 601], and my version of [[Machine_Learning_10-601_in_Fall_2014#Policy_on_Collaboration_among_Students|10-601 in fall 2014]].

Latest revision as of 15:18, 4 May 2018

Instructor and Venue

  • Instructor: William Cohen, Machine Learning Dept and LTI
  • When/where: GHC 4307, 3-4:30pm, Mondays and Wednesdays
  • Course Number: ML 10-405
  • Prerequisites:
    • a CMU intro machine learning course (e.g., 10-701, 10-715, 10-601, 10-401).
      • You may take this 401/601/701 concurrently with 405.
    • Good programming skills, e.g., 15-210, or 15-214 or equivalent.
  • Course staff:
    • William Cohen
    • TAs:
      • Vivek Shankar vshanka1@andrew.cmu.edu
      • Nitish Kumar Kulkarni nitishkk@andrew.cmu.edu
      • Vidhan Agarwal vidhana@andrew.cmu.edu
      • Sarthak Garg sarthakg@andrew.cmu.edu

Office hours

  • William: Monday 11am-12, 8217 Gates Hillman
  • Vidhan: Tuesday 1:00pm-2:00pm, GHC 6th floor commons (opposite 6105)
  • Sarthak: Wednesday 4:30pm-5:30pm, GHC 6th floor commons (opposite 6105)
  • Vivek: Thursday 4:30pm-5:30pm, GHC 5th floor commons
  • Nitish: Friday 3:00pm-4:00pm, GHC 6th floor commons (opposite 6105)

Important virtual places/account information

Mostly these are TBA

  • Autolab page: for assignments
  • Piazza page for class: home for questions and discussion. You should sign up for the class with your andrew email.
    • For mailing the instructors questions NOT of general interest (eg, "My girlfriend is in town, can I have an extension?") use the emailing list, 10405-Instructors@cs.cmu.edu
  • Recorded lectures: Lectures are not recorded for 10-405. However, many of the lectures overlap with 10-605, and recordings of those lectures are still available.
  • AFS data repository /afs/cs.cmu.edu/project/bigML
  • Hadoop cluster information - students will recieve an account on the OpenCloud cluster.
    • The location of the streaming JAR for this cluster is: /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar
  • Amazon Elastic MapReduce information
  • Guide for Happy Hadoop Hacking - Malcolm's advice on effectively writing Hadoop programs.
  • For TAs/instructors only:
    • Configuring Hadoop jobs on Autolab [1]
    • Autolab AFS dir /afs/cs.cmu.edu/academic/class/10405-s18/ - students cannot read this. TAs, you may need to use 'aklog cs.cmu.edu' before accessing it.
    • Planning gDoc: - students cannot read this

Description

Large datasets are difficult to work with for several reasons. They are difficult to visualize, and it is difficult to understand what sort of errors and biases are present in them. They are computationally expensive to process, and often the cost of learning is hard to predict - for instance, and algorithm that runs quickly in a dataset that fits in memory may be exorbitantly expensive when the dataset is too large for memory. Large datasets may also display qualitatively different behavior in terms of which learning methods produce the most accurate predictions.

This course is intended to provide a student practical knowledge of, and experience with, the issues involving large datasets. Among the issues considered are: scalable learning techniques, such as streaming machine learning techniques; parallel infrastructures such as map-reduce; practical techniques for reducing the memory requirements for learning methods, such as feature hashing and Bloom filters; and techniques for analysis of programs in terms of memory, disk usage, and (for parallel methods) communication complexity.

Course Outcomes

In summary, students who successfully complete the course should be able to:

  • Implement non-trivial algorithms that make effective use of map-reduce infrastructure, like Hadoop, to work with text corpora and large graphs.
  • Analyze the time and communication complexity of map-reduce algorithms.
  • Discuss coherently the differences between different dataflow languages.
  • Implement algorithms using dataflow langages.
  • Implement streaming learning algorithms that make use of sparsity to efficiently process examples from a very large feature set.
  • Understand and explain the algorithms used for automatic differentiation in deep-learning platforms such as Theano, Tensorflow, Torch, and PyTorch.
  • Implement learning algorithms that make use of parameter servers.
  • Explain the scalability issues in probabilistic models estimated by sampling methods, and discuss approaches to fast sampling.
  • Understand the potential applications of large graphs to semisupervised an unsupervised learning problems.
  • Implement a general framework for developing gradient-descent optimizers for machine learning applications.
  • Explain the differences between, and recognize potential applications of, randomized data structures such as locality sensitive hashing, Bloom filters, random projections, and count-min sketches.

Syllabus

While 10-405 is new, it covers similar material to 10-605. Here are some previous syllabi.

Recitations

TBA

Prerequisites

An introductory course in machine learning (one of 10-401, 10-601, 10-701, or 10-715) is a prerequisite or a co-requisite. Except in very rare cases I will NOT approve substitutions---it's just not feasible for me to check equivalence of a different intro ML class, or set of classes, for a large number of students.

The course will include several substantial programming assignments, so an additional prerequisite is 15-210, or 15-214, or familiarity with Python, Unix, and good programming skills.

10-605 vs 10-405

10-405 will be similar in scope to 10-605 as offered in previous years. Students are graded based mainly on programming assignments, a midterm, and a final.

Grading Policies

  • 60% assignments. There will be six assignments.
    • For most assignments there will be a "checkpoint" deliverable partway thru the assignment:
      • The goal of the check point is to ensure you started the real assignment.
      • It will be worth usually 10 points out of 100.
      • It will not be autolab-graded, so it doesn't need to execute. But if you turn in nothing, or something that clearly doesn't come close to the checkpoint goals, you will be penalized.
  • 15% for an in-class midterm exam.
  • 20% for an in-class final exam.
  • 5% class participation and in-class quizzes.
  • You have the option of replacing one assignment with an "open-ended extension"
    • Roughly, this would be an extension to an existing assignment that would increase the programming effort by at least 50%.
    • Your deliverable is a handout for the extension, like a TA would produce, and a solution key, including code.
    • You should send the instructors email list a rough draft as early as you can, where you sketch out the technical approach you will follow, and mention any research or exploration that you've done to reduce technical risk.
      • For example, if you decide to implement IPM, will you use All-Reduce or not? if you use it, will you implement it or use an existing implementation, and if it's an existing implementation, have you verified that it works properly on your target cluster? What's the strategy that you will use to ensure mappers don't reload the data from network?
      • We will try and give you feedback promptly, but you should allocate a few days for us to look over these and approve them.
    • Example: for HW1B, implement the Rocchio algorithm as well as naive Bayes and compare them.

Policies and FAQ

  • If you're taking 10-405 and 10-401/601/701/715 concurrently, you don't need to ask for permission - we can now check co-recs automatically. Just sign up.
  • Is there a textbook? No: but I do have written notes for some topics (e.g., naive bayes and streaming) which are linked to from the wiki.
  • I forgot to take a quiz - can I make it up? There are no makeups for the quizzes. Their purpose is to ensure that you follow along with the class and review the lectures promptly, and makeups would defeat that goal.
  • Can I get an extension on ....? Generally no, but you can get 50% credit for up to 48 hrs after the assignment is due, and there are a fixed number of grace days. If you have a documented medical issue or something similar email William.
  • I'm completely stressed out about all my classes, and/or having a terrible time because... What should I do? Here's some common-sense advice (in this context "common-sense" means "nobody is surprised to hear this but nobody actually acts on it either":
    • Take care of yourself. Eat well, make exercise a priority, getting enough sleep, and take some time to relax.
    • Stay organized. Spend some time (say 60min) every week reviewing your long and short term goals, and then spend the rest of the time not obsessing about what's due, but working efficiently on your priorities. I personally recommend the gtd approach.
    • Get help when you need it. All of us benefit from support sometimes. You are not alone. As a CMU student you have access to many resources, and asking for support sooner rather than later is often helpful. If you're experiencing any academic stress, difficult life events, or feelings like anxiety or depression, Counseling and Psychological Services (CaPS) is available: call 412-268-2922 or visit their website at http://www.cmu.edu/counseling/. Or consider reaching out to a friend, faculty or family member you trust.
    • Look out for each other. If your friends or classmates are stressed out or depressed, talk to them and remind them that support is available. Importantly, if you or someone you know is feeling suicidal or in danger of self-harm, call someone immediately, day or night. You can call CaPS (412-268-2922), the Re:solve Crisis Network (888-796-8226) or, if the situation is life threatening, the police (on campus: CMU Police: 412-268-2323, and off campus: 911).

Policy on Collaboration among Students

The purpose of student collaboration is to facilitate learning, not to circumvent it. Studying the material in groups is strongly encouraged. It is also allowed to seek help from other students in understanding the material needed to solve a particular homework problem, provided no written notes are shared, or are taken at that time, and provided learning is facilitated, not circumvented. The actual solution must be done by each student alone, and the student should be ready to reproduce their solution upon request.

The presence or absence of any form of help or collaboration, whether given or received, must be explicitly stated and disclosed in full by all involved, on the first page of their assignment. Specifically, each assignment solution must start by answering the following questions:

(1) Did you receive any help whatsoever from anyone in solving this assignment? Yes / No.
If you answered 'yes', give full details: _______________ (e.g. "Jane explained to me what is asked in Question 3.4")
(2) Did you give any help whatsoever to anyone in solving this assignment? Yes / No.
If you answered 'yes', give full details: _______________ (e.g. "I pointed Joe to section 2.3 to help him with Question 2".

Additionally, if you share any material or collaborate in any way between the time the assignment is due and the last time when the assignment can be handed in for partial credit, you must notify the instructor of this help in writing (eg via email).

Collaboration without full disclosure will be handled severely, in compliance with CMU's Policy on Cheating and Plagiarism. Except in usual extenuating circumstances, my policy is to fail the student(s) for the entire course.

As a related point, many of the homework assignments used in this class may have been used in prior versions of this class, or in classes at other institutions. Avoiding the use of heavily tested assignments will detract from the main purpose of these assignments, which is to reinforce the material and stimulate thinking. Because some of these assignments may have been used before, solutions to them may be (or may have been) available online, or from other people. It is explicitly forbidden to use any such sources, or to consult people who have solved these problems before. You must solve the homework assignments completely on your own. I will mostly rely on your wisdom and honor to follow this rule, but if a violation is detected it will be dealt with harshly. Collaboration with other students who have previously taken the class is allowed, but only under the conditions stated below.

These policies are the same as were used in Dr. Rosenfeld's previous version of 601, and my version of 10-601 in fall 2014.