Difference between revisions of "Machine Learning with Large Datasets 10-605 in Spring 2014"

From Cohen Courses
Jump to navigationJump to search
 
(30 intermediate revisions by 3 users not shown)
Line 2: Line 2:
  
 
* Instructor: [http://www.cs.cmu.edu/~wcohen William Cohen], Machine Learning Dept and LTI
 
* Instructor: [http://www.cs.cmu.edu/~wcohen William Cohen], Machine Learning Dept and LTI
* Course secretary: Sandy Winkler, sandyw@cs.cmu.edu, GHC 8219
+
* When/where: 1:30-2:50 MW '''Doherty Hall 1112''' (''not'' Hamberg Hall, as previously announced!)
* When/where: 1:30-2:50 MW Hamburg Hall 131
 
 
* Course Number: ML 10-605
 
* Course Number: ML 10-605
 
* Prerequisites:  
 
* Prerequisites:  
 
** a machine learning course (e.g., 10-701 or 10-601).  You may take this concurrently with the instructor's permission.   
 
** a machine learning course (e.g., 10-701 or 10-601).  You may take this concurrently with the instructor's permission.   
 
** Java programming skills, e.g., 15-210, or 15-214.
 
** Java programming skills, e.g., 15-210, or 15-214.
* TAs:  
+
* Course staff:  
 +
** William Cohen - office hour 4-5pm Tuesday, GHC 8217
 +
** William Wang (ww@cmu.edu) - ''TA'', office hour: Tue 10-11 am, GHC 5511.
 +
** Siddarth Varia (varias@cs.cmu.edu) - ''TA'', office hour: Thu 10-11 am, GHC 8114.
 
** Chun Chen (chunc@andrew.cmu.edu) - ''grader''
 
** Chun Chen (chunc@andrew.cmu.edu) - ''grader''
** Siddarth Varia (varias@cs.cmu.edu)
+
** Sandy Winkler (sandyw@cs.cmu.edu, GHC 8219) - ''course secretary''
** William Wang (ww@cmu.edu)
 
 
* Syllabus: [[Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2014]]
 
* Syllabus: [[Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2014]]
* Office hours: TBA
 
  
== Important virtual places ==
+
== Important virtual places/account information ==
  
* Autolab page: [https://autolab.cs.cmu.edu/10605-s14]
+
* Autolab page: [https://autolab.cs.cmu.edu/10605-s14], for programming assignments.
* Piazza page for class ''(to add)''
+
* Piazza page for class: [https://piazza.com/cmu/spring2014/10605/home], for questions and discussion.
 +
** Unlike the lower-level CS courses, the staff for this class is small, so do NOT expect immediate answers to questions.
 +
** For mailing the instructors questions NOT of general interest (eg, "My girlfriend is in town, can I have an extension?") you can also use the emailing list, 10605-Instructors@cs.cmu.edu
 +
* MediaTech page for lectures: [https://mediatech-stream.andrew.cmu.edu/Mediasite/Catalog/Full/c79cdb5f7ede415ab4ed8dbd28a6e18321]
 +
** Lectures should be posted within 24 hours.
 
* AFS data repository  ''/afs/cs.cmu.edu/project/bigML''
 
* AFS data repository  ''/afs/cs.cmu.edu/project/bigML''
* For TAs/instructors only:
+
* [[Hadoop cluster information]].
** Autolab AFS dir ''/afs/cs.cmu.edu/academic/class/10605-s14/autolab''  
+
* [[Guide for Happy Hadoop Hacking]] - Malcolm's advice on effectively writing Hadoop programs.
** Planning gDoc:  https://docs.google.com/spreadsheet/ccc?key=0AqbWt5nnjNrYdFc4UlAyZmhYNGQxNFVTWWVob2pPU0E#gid=0
+
* For TAs/instructors ''only'':
 +
** Autolab AFS dir ''/afs/cs.cmu.edu/academic/class/10605-s14/autolab'' - students cannot read this
 +
** Planning gDoc:  https://docs.google.com/spreadsheet/ccc?key=0AqbWt5nnjNrYdFc4UlAyZmhYNGQxNFVTWWVob2pPU0E#gid=0 - students cannot read this
  
 
== Description ==
 
== Description ==
Line 28: Line 34:
 
Large datasets are difficult to work with for several reasons.  They are difficult to visualize, and it is difficult to understand what sort of errors and biases are present in them.  They are computationally expensive to process, and often the cost of learning is hard to predict - for instance, and algorithm that runs quickly in a dataset that fits in memory may be exorbitantly expensive when the dataset is too large for memory.  Large datasets may also display qualitatively different behavior in terms of which learning methods produce the most accurate predictions.
 
Large datasets are difficult to work with for several reasons.  They are difficult to visualize, and it is difficult to understand what sort of errors and biases are present in them.  They are computationally expensive to process, and often the cost of learning is hard to predict - for instance, and algorithm that runs quickly in a dataset that fits in memory may be exorbitantly expensive when the dataset is too large for memory.  Large datasets may also display qualitatively different behavior in terms of which learning methods produce the most accurate predictions.
  
This course is intended to provide a student practical knowledge of, and experience with, the issues involving large datasets.  Among the issues considered are: scalable learning techniques, such as streaming machine learning techniques; parallel infrastructures such as map-reduce; practical techniques for reducing the memory requirements for learning methods, such as feature hashing and Bloom filters; and techniques for analysis of programs in terms of memory, disk usage, and (for parallel methods) communication complexity
+
This course is intended to provide a student practical knowledge of, and experience with, the issues involving large datasets.  Among the issues considered are: scalable learning techniques, such as streaming machine learning techniques; parallel infrastructures such as map-reduce; practical techniques for reducing the memory requirements for learning methods, such as feature hashing and Bloom filters; and techniques for analysis of programs in terms of memory, disk usage, and (for parallel methods) communication complexity.
 
 
The class will include frequent programming assignments, and a one-month short project chosen by the student.  The project should be relevant to the course - e.g.,  to compare the scalability of variant learning algorithms on datasets.
 
  
 
== Syllabus ==
 
== Syllabus ==
  
* Tenatative [[Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2014]]
+
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2014]]
  
 
I'm following previous versions of the class, below:
 
I'm following previous versions of the class, below:
Line 50: Line 54:
  
 
Self-assessment for students:
 
Self-assessment for students:
* Students, especially graduate students, come to CMU with a variety of different backgrounds, so formal course prereqs are hard to establish.  There is a short  [http://www.cs.cmu.edu/~wcohen/10-601/Intro_ML_Self_Evaluation.pdf self-assessment test] to see if you have the necessary background for 10-605.  We recommend that all students take this before enrolling in 10-605 to see if they have the necessary background knowledge already, or if they need to review and/or take additional courses.
+
* Students, especially graduate students, come to CMU with a variety of different backgrounds, so formal course prereqs are hard to establish.  There is a short  [http://www.cs.cmu.edu/~wcohen/10-601/self-assessment/Intro_ML_Self_Evaluation.pdf self-assessment test] to see if you have the necessary background for 10-605.  We recommend that all students take this before enrolling in 10-605 to see if they have the necessary background knowledge already, or if they need to review and/or take additional courses.  '''Note''': this is the ''same self-assessment I used for 10-601'' last fall - you don't need to worry about it if you've had 10-601 and done well.  Also, Section 4 can be skipped.
 +
 
 +
== Policies and FAQ ==
 +
 
 +
* '''Can I take the class pass/fail? Or, can I audit?'''  My policy is to give priority to students that are taking the class for a grade, so you cannot sign up for the class pass/fail or as an audit unless the waitlist clears.
 +
* '''Can I get an extension on ....?''' Do, but you can get 50% credit for up to 48 hrs after the assignment is due, and you can drop your lowest assignment grade.
 +
 
 +
=== Policy on Collaboration among Students  ===
 +
 
 +
The purpose of student collaboration is to facilitate learning, not to circumvent it. Studying the material in groups is strongly encouraged. It is also allowed to seek help from other students in understanding the material needed to solve a particular homework problem, provided '''no written notes''' are shared, or are taken at that time, and provided learning is facilitated, not circumvented. The actual solution must be done by each student alone, and the student should be ready to reproduce their solution upon request.
 +
 
 +
'''The presence or absence of any form of help or collaboration, whether given or received, must be explicitly stated and disclosed in full by all involved''', on the first page of their assignment. Specifically, each assignment solution must start by answering the following questions:
 +
 
 +
(1) Did you receive any help whatsoever from anyone in solving this assignment? Yes / No.
 +
If you answered 'yes', give full details: _______________ (e.g. "Jane explained to me what is asked in Question 3.4")
 +
(2) Did you give any help whatsoever to anyone in solving this assignment? Yes / No.
 +
If you answered 'yes', give full details: _______________ (e.g. "I pointed Joe to section 2.3 to help him with Question 2".
 +
 
 +
Collaboration without full disclosure will be handled severely, in compliance with CMU's Policy on Cheating and Plagiarism.  Except in usual extenuating circumstances, my policy is to '''fail the student(s) for the entire course'''.
 +
 
 +
As a related point, many of the homework assignments used in this class may have been used in prior versions of this class, or in classes at other institutions.  Avoiding the use of heavily tested assignments will detract from the main purpose of these assignments, which is to reinforce the material and stimulate thinking.  Because some of these assignments may have been used before, solutions to them may be (or may have been) available online, or from other people.  It is explicitly forbidden to use any such sources, or to consult people who have solved these problems before.  '''You must solve the homework assignments completely on your own'''. I will mostly rely on your wisdom and honor to follow this rule, but if a violation is detected it will be dealt with harshly.  Collaboration with other students who have previously taken the class is allowed, but only under the conditions stated below.
 +
 
 +
These policies are the same as were used in [http://www.cs.cmu.edu/~roni/10601/ Dr. Rosenfeld's previous version of 601], and my version of [[Machine_Learning_10-601_in_Fall_2013#Policy_on_Collaboration_among_Students|10-601 in fall 2013]].

Latest revision as of 23:02, 18 February 2014

Instructor and Venue

  • Instructor: William Cohen, Machine Learning Dept and LTI
  • When/where: 1:30-2:50 MW Doherty Hall 1112 (not Hamberg Hall, as previously announced!)
  • Course Number: ML 10-605
  • Prerequisites:
    • a machine learning course (e.g., 10-701 or 10-601). You may take this concurrently with the instructor's permission.
    • Java programming skills, e.g., 15-210, or 15-214.
  • Course staff:
    • William Cohen - office hour 4-5pm Tuesday, GHC 8217
    • William Wang (ww@cmu.edu) - TA, office hour: Tue 10-11 am, GHC 5511.
    • Siddarth Varia (varias@cs.cmu.edu) - TA, office hour: Thu 10-11 am, GHC 8114.
    • Chun Chen (chunc@andrew.cmu.edu) - grader
    • Sandy Winkler (sandyw@cs.cmu.edu, GHC 8219) - course secretary
  • Syllabus: Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2014

Important virtual places/account information

  • Autolab page: [1], for programming assignments.
  • Piazza page for class: [2], for questions and discussion.
    • Unlike the lower-level CS courses, the staff for this class is small, so do NOT expect immediate answers to questions.
    • For mailing the instructors questions NOT of general interest (eg, "My girlfriend is in town, can I have an extension?") you can also use the emailing list, 10605-Instructors@cs.cmu.edu
  • MediaTech page for lectures: [3]
    • Lectures should be posted within 24 hours.
  • AFS data repository /afs/cs.cmu.edu/project/bigML
  • Hadoop cluster information.
  • Guide for Happy Hadoop Hacking - Malcolm's advice on effectively writing Hadoop programs.
  • For TAs/instructors only:

Description

Large datasets are difficult to work with for several reasons. They are difficult to visualize, and it is difficult to understand what sort of errors and biases are present in them. They are computationally expensive to process, and often the cost of learning is hard to predict - for instance, and algorithm that runs quickly in a dataset that fits in memory may be exorbitantly expensive when the dataset is too large for memory. Large datasets may also display qualitatively different behavior in terms of which learning methods produce the most accurate predictions.

This course is intended to provide a student practical knowledge of, and experience with, the issues involving large datasets. Among the issues considered are: scalable learning techniques, such as streaming machine learning techniques; parallel infrastructures such as map-reduce; practical techniques for reducing the memory requirements for learning methods, such as feature hashing and Bloom filters; and techniques for analysis of programs in terms of memory, disk usage, and (for parallel methods) communication complexity.

Syllabus

I'm following previous versions of the class, below:

One different: since this semester's class is so large, we will not have a course project component.

Prerequisites

An introductory course in machine learning, like 10-601 or 10-701, is a prerequisite or a co-requisite. If you plan to take this course and 10-601 concurrently please tell the instructor.

The course will include several substantial programming assignments, so an additional prerequisite is 15-210, or 15-214, or comparable familiarity with Java and good programming skills.

Self-assessment for students:

  • Students, especially graduate students, come to CMU with a variety of different backgrounds, so formal course prereqs are hard to establish. There is a short self-assessment test to see if you have the necessary background for 10-605. We recommend that all students take this before enrolling in 10-605 to see if they have the necessary background knowledge already, or if they need to review and/or take additional courses. Note: this is the same self-assessment I used for 10-601 last fall - you don't need to worry about it if you've had 10-601 and done well. Also, Section 4 can be skipped.

Policies and FAQ

  • Can I take the class pass/fail? Or, can I audit? My policy is to give priority to students that are taking the class for a grade, so you cannot sign up for the class pass/fail or as an audit unless the waitlist clears.
  • Can I get an extension on ....? Do, but you can get 50% credit for up to 48 hrs after the assignment is due, and you can drop your lowest assignment grade.

Policy on Collaboration among Students

The purpose of student collaboration is to facilitate learning, not to circumvent it. Studying the material in groups is strongly encouraged. It is also allowed to seek help from other students in understanding the material needed to solve a particular homework problem, provided no written notes are shared, or are taken at that time, and provided learning is facilitated, not circumvented. The actual solution must be done by each student alone, and the student should be ready to reproduce their solution upon request.

The presence or absence of any form of help or collaboration, whether given or received, must be explicitly stated and disclosed in full by all involved, on the first page of their assignment. Specifically, each assignment solution must start by answering the following questions:

(1) Did you receive any help whatsoever from anyone in solving this assignment? Yes / No.
If you answered 'yes', give full details: _______________ (e.g. "Jane explained to me what is asked in Question 3.4")
(2) Did you give any help whatsoever to anyone in solving this assignment? Yes / No.
If you answered 'yes', give full details: _______________ (e.g. "I pointed Joe to section 2.3 to help him with Question 2".

Collaboration without full disclosure will be handled severely, in compliance with CMU's Policy on Cheating and Plagiarism. Except in usual extenuating circumstances, my policy is to fail the student(s) for the entire course.

As a related point, many of the homework assignments used in this class may have been used in prior versions of this class, or in classes at other institutions. Avoiding the use of heavily tested assignments will detract from the main purpose of these assignments, which is to reinforce the material and stimulate thinking. Because some of these assignments may have been used before, solutions to them may be (or may have been) available online, or from other people. It is explicitly forbidden to use any such sources, or to consult people who have solved these problems before. You must solve the homework assignments completely on your own. I will mostly rely on your wisdom and honor to follow this rule, but if a violation is detected it will be dealt with harshly. Collaboration with other students who have previously taken the class is allowed, but only under the conditions stated below.

These policies are the same as were used in Dr. Rosenfeld's previous version of 601, and my version of 10-601 in fall 2013.