http://curtis.ml.cmu.edu/w/courses/api.php?action=feedcontributions&user=Wcohen&feedformat=atomCohen Courses - User contributions [en]2018-10-15T18:16:30ZUser contributionsMediaWiki 1.21.1http://curtis.ml.cmu.edu/w/courses/index.php/Machine_Learning_with_Large_Datasets_10-405_in_Spring_2018Machine Learning with Large Datasets 10-405 in Spring 20182018-05-04T19:18:51Z<p>Wcohen: /* Instructor and Venue */</p>
<hr />
<div>== Instructor and Venue ==<br />
<br />
* Instructor: [http://www.cs.cmu.edu/~wcohen William Cohen], Machine Learning Dept and LTI<br />
* When/where: GHC 4307, 3-4:30pm, Mondays and Wednesdays<br />
* Course Number: ML 10-405<br />
* Prerequisites: <br />
** a CMU intro machine learning course (e.g., 10-701, 10-715, 10-601, 10-401). <br />
*** You may take this 401/601/701 concurrently with 405. <br />
** Good programming skills, e.g., 15-210, or 15-214 or equivalent.<br />
* Course staff: <br />
** William Cohen<br />
** TAs:<br />
*** Vivek Shankar vshanka1@andrew.cmu.edu<br />
*** Nitish Kumar Kulkarni nitishkk@andrew.cmu.edu<br />
*** Vidhan Agarwal vidhana@andrew.cmu.edu<br />
*** Sarthak Garg sarthakg@andrew.cmu.edu<br />
<br />
** Dorothy Holland-Minkley (dfh@cs.cmu.edu, GHC 8001) - ''course admin''<br />
* Syllabus: [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018]]<br />
<br />
== Office hours ==<br />
<br />
* William: Monday 11am-12, 8217 Gates Hillman<br />
* Vidhan: Tuesday 1:00pm-2:00pm, GHC 6th floor commons (opposite 6105)<br />
* Sarthak: Wednesday 4:30pm-5:30pm, GHC 6th floor commons (opposite 6105)<br />
* Vivek: Thursday 4:30pm-5:30pm, GHC 5th floor commons<br />
* Nitish: Friday 3:00pm-4:00pm, GHC 6th floor commons (opposite 6105)<br />
<br />
== Important virtual places/account information ==<br />
<br />
''Mostly these are TBA''<br />
<br />
* [https://autolab.andrew.cmu.edu/courses/10405-s18/assessments Autolab page]: for assignments<br />
* [http://piazza.com/cmu/spring2018/10405/home Piazza page for class]: home for questions and discussion. You should [http://piazza.com/cmu/spring2018/10405 sign up] for the class with your andrew email.<br />
** For mailing the instructors questions NOT of general interest (eg, "My girlfriend is in town, can I have an extension?") use the emailing list, 10405-Instructors@cs.cmu.edu<br />
* Recorded lectures: Lectures are not recorded for 10-405. However, many of the lectures overlap with 10-605, and recordings of those lectures are still available.<br />
** Previous lectures from 10-605: [https://scs.hosted.panopto.com/Panopto/Pages/Sessions/List.aspx#folderID=7ae48b11-88af-4f02-b179-28f65ea796a4 Fall 2017]<br />
* AFS data repository ''/afs/cs.cmu.edu/project/bigML''<br />
* [https://wiki.pdl.cmu.edu/Stoat Hadoop cluster information] - students will recieve an account on the OpenCloud cluster.<br />
** The location of the streaming JAR for this cluster is: /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar<br />
* [[Amazon Elastic MapReduce information]]<br />
* [[Guide for Happy Hadoop Hacking]] - Malcolm's advice on effectively writing Hadoop programs.<br />
* For TAs/instructors ''only'':<br />
** Configuring Hadoop jobs on Autolab [http://curtis.ml.cmu.edu/w/courses/index.php/Guide_for_Configuring_Autolab]<br />
** Autolab AFS dir ''/afs/cs.cmu.edu/academic/class/10405-s18/'' - students cannot read this. TAs, you may need to use 'aklog cs.cmu.edu' before accessing it.<br />
** [https://docs.google.com/document/d/13uGcV-Alqe0Y_WWqRrdGs7VTn8CeZSvxrG9GK4nx5BU/edit#heading=h.my65mtwmsx4h Planning gDoc]: - students cannot read this<br />
<br />
== Description ==<br />
<br />
Large datasets are difficult to work with for several reasons. They are difficult to visualize, and it is difficult to understand what sort of errors and biases are present in them. They are computationally expensive to process, and often the cost of learning is hard to predict - for instance, and algorithm that runs quickly in a dataset that fits in memory may be exorbitantly expensive when the dataset is too large for memory. Large datasets may also display qualitatively different behavior in terms of which learning methods produce the most accurate predictions.<br />
<br />
This course is intended to provide a student practical knowledge of, and experience with, the issues involving large datasets. Among the issues considered are: scalable learning techniques, such as streaming machine learning techniques; parallel infrastructures such as map-reduce; practical techniques for reducing the memory requirements for learning methods, such as feature hashing and Bloom filters; and techniques for analysis of programs in terms of memory, disk usage, and (for parallel methods) communication complexity.<br />
<br />
=== Course Outcomes ===<br />
<br />
In summary, students who successfully complete the course should be able to:<br />
<br />
* Implement non-trivial algorithms that make effective use of map-reduce infrastructure, like Hadoop, to work with text corpora and large graphs. <br />
* Analyze the time and communication complexity of map-reduce algorithms.<br />
* Discuss coherently the differences between different dataflow languages.<br />
* Implement algorithms using dataflow langages.<br />
* Implement streaming learning algorithms that make use of sparsity to efficiently process examples from a very large feature set.<br />
* Understand and explain the algorithms used for automatic differentiation in deep-learning platforms such as Theano, Tensorflow, Torch, and PyTorch.<br />
* Implement learning algorithms that make use of parameter servers.<br />
* Explain the scalability issues in probabilistic models estimated by sampling methods, and discuss approaches to fast sampling.<br />
* Understand the potential applications of large graphs to semisupervised an unsupervised learning problems.<br />
* Implement a general framework for developing gradient-descent optimizers for machine learning applications.<br />
* Explain the differences between, and recognize potential applications of, randomized data structures such as locality sensitive hashing, Bloom filters, random projections, and count-min sketches.<br />
<br />
== Syllabus ==<br />
<br />
* [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018]]<br />
<br />
While 10-405 is new, it covers similar material to 10-605. Here are some previous syllabi.<br />
<br />
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Fall 2017]]<br />
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Fall 2016]]<br />
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Fall 2015]]<br />
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2015]]<br />
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2014]]<br />
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2013]]<br />
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2012]]<br />
<br />
== Recitations ==<br />
<br />
TBA<br />
<br />
== Prerequisites ==<br />
<br />
An introductory course in machine learning (one of 10-401, 10-601, 10-701, or 10-715) is a prerequisite or a co-requisite. Except in very rare cases I will NOT approve substitutions---it's just not feasible for me to check equivalence of a different intro ML class, or set of classes, for a large number of students.<br />
<br />
The course will include several substantial programming assignments, so an additional prerequisite is 15-210, or 15-214, or familiarity with Python, Unix, and good programming skills.<br />
<br />
== 10-605 vs 10-405 == <br />
<br />
10-405 will be similar in scope to 10-605 as offered in previous years. Students are graded based mainly on programming assignments, a midterm, and a final. <br />
<br />
=== Grading Policies ===<br />
<br />
* 60% assignments. There will be six assignments.<br />
** For most assignments there will be a "checkpoint" deliverable partway thru the assignment:<br />
*** The goal of the check point is to ensure you started the real assignment. <br />
*** It will be worth usually 10 points out of 100.<br />
*** It will not be autolab-graded, so it doesn't need to execute. But if you turn in nothing, or something that clearly doesn't come close to the checkpoint goals, you will be penalized.<br />
* 15% for an in-class midterm exam.<br />
* 20% for an in-class final exam.<br />
* 5% class participation and in-class quizzes.<br />
<br />
* You have the option of replacing one assignment with an "open-ended extension"<br />
** Roughly, this would be an extension to an existing assignment that would increase the programming effort by at least 50%.<br />
** Your deliverable is a handout for the extension, like a TA would produce, and a solution key, including code.<br />
** You should send the instructors email list a rough draft as early as you can, where you sketch out the technical approach you will follow, and mention any research or exploration that you've done to reduce technical risk. <br />
*** ''For example,'' if you decide to implement IPM, will you use All-Reduce or not? if you use it, will you implement it or use an existing implementation, and if it's an existing implementation, have you verified that it works properly on your target cluster? What's the strategy that you will use to ensure mappers don't reload the data from network?<br />
*** We will try and give you feedback promptly, but you should allocate a few days for us to look over these and approve them.<br />
** Example: for HW1B, implement the Rocchio algorithm as well as naive Bayes and compare them.<br />
<br />
== Policies and FAQ ==<br />
<br />
* If you're taking 10-405 and 10-401/601/701/715 concurrently, you don't need to ask for permission - we can now check co-recs automatically. Just sign up.<br />
* '''Is there a textbook?''' No: but I do have written notes for some topics (e.g., [http://www.cs.cmu.edu/~wcohen/10-605/notes/scalable-nb-notes.pdf naive bayes and streaming]) which are linked to from the wiki.<br />
* '''I forgot to take a quiz - can I make it up?''' There are no makeups for the quizzes. Their purpose is to ensure that you follow along with the class and review the lectures promptly, and makeups would defeat that goal.<br />
* '''Can I get an extension on ....?''' Generally no, but you can get 50% credit for up to 48 hrs after the assignment is due, and there are a fixed number of grace days. If you have a documented medical issue or something similar email William.<br />
* '''I'm completely stressed out about all my classes, and/or having a terrible time because... What should I do?''' Here's some common-sense advice (in this context "common-sense" means "nobody is surprised to hear this but nobody actually acts on it either":<br />
** ''Take care of yourself.'' Eat well, make exercise a priority, getting enough sleep, and take some time to relax. <br />
** ''Stay organized.'' Spend some time (say 60min) every week reviewing your long and short term goals, and then spend the rest of the time not obsessing about what's due, but working efficiently on your priorities. I personally recommend the [https://en.wikipedia.org/wiki/Getting_Things_Done gtd] approach.<br />
** ''Get help when you need it.'' All of us benefit from support sometimes. You are not alone. As a CMU student you have access to many resources, and asking for support sooner rather than later is often helpful. If you're experiencing any academic stress, difficult life events, or feelings like anxiety or depression, Counseling and Psychological Services (CaPS) is available: call 412-268-2922 or visit their website at http://www.cmu.edu/counseling/. Or consider reaching out to a friend, faculty or family member you trust.<br />
** ''Look out for each other.'' If your friends or classmates are stressed out or depressed, talk to them and remind them that support is available. Importantly, if you or someone you know is feeling suicidal or in danger of self-harm, call someone immediately, day or night. You can call CaPS (412-268-2922), the Re:solve Crisis Network (888-796-8226) or, if the situation is life threatening, the police (on campus: CMU Police: 412-268-2323, and off campus: 911).<br />
<br />
=== Policy on Collaboration among Students ===<br />
<br />
The purpose of student collaboration is to facilitate learning, not to circumvent it. Studying the material in groups is strongly encouraged. It is also allowed to seek help from other students in understanding the material needed to solve a particular homework problem, provided '''no written notes''' are shared, or are taken at that time, and provided learning is facilitated, not circumvented. The actual solution must be done by each student alone, and the student should be ready to reproduce their solution upon request.<br />
<br />
'''The presence or absence of any form of help or collaboration, whether given or received, must be explicitly stated and disclosed in full by all involved''', on the first page of their assignment. Specifically, each assignment solution must start by answering the following questions:<br />
<br />
(1) Did you receive any help whatsoever from anyone in solving this assignment? Yes / No.<br />
If you answered 'yes', give full details: _______________ (e.g. "Jane explained to me what is asked in Question 3.4")<br />
(2) Did you give any help whatsoever to anyone in solving this assignment? Yes / No.<br />
If you answered 'yes', give full details: _______________ (e.g. "I pointed Joe to section 2.3 to help him with Question 2".<br />
<br />
Additionally, if you share any material or collaborate in any way between the time the assignment is due and the last time when the assignment can be handed in for partial credit, you must notify the instructor of this help in writing (eg via email).<br />
<br />
Collaboration without full disclosure will be handled severely, in compliance with CMU's Policy on Cheating and Plagiarism. Except in usual extenuating circumstances, my policy is to '''fail the student(s) for the entire course'''.<br />
<br />
As a related point, many of the homework assignments used in this class may have been used in prior versions of this class, or in classes at other institutions. Avoiding the use of heavily tested assignments will detract from the main purpose of these assignments, which is to reinforce the material and stimulate thinking. Because some of these assignments may have been used before, solutions to them may be (or may have been) available online, or from other people. It is explicitly forbidden to use any such sources, or to consult people who have solved these problems before. '''You must solve the homework assignments completely on your own'''. I will mostly rely on your wisdom and honor to follow this rule, but if a violation is detected it will be dealt with harshly. Collaboration with other students who have previously taken the class is allowed, but only under the conditions stated below.<br />
<br />
These policies are the same as were used in [http://www.cs.cmu.edu/~roni/10601/ Dr. Rosenfeld's previous version of 601], and my version of [[Machine_Learning_10-601_in_Fall_2014#Policy_on_Collaboration_among_Students|10-601 in fall 2014]].</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_Review_session_for_finalClass meeting for 10-405 Review session for final2018-05-04T18:49:43Z<p>Wcohen: /* Information on the final */</p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Information on the final ===<br />
<br />
* [http://www.cs.cmu.edu/~wcohen/10-405/final-review.pptx Slides in PPT], [http://www.cs.cmu.edu/~wcohen/10-405/final-review.pdf Slides in PDF]<br />
<br />
For general tips, you should look over my slides from the midterm review session. The exam is closed book but you can take in '''two''' sheets of 8.5x11" or A4 paper (front and back). The exam is 80 minute at the usual class time and location.<br />
<br />
Practice questions from 10-605 (some questions removed, so these don't exactly reflect the length of the exam):<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/s2014-final.pdf practice questions for final, 2014].<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/s2015-final.pdf practice questions for final, 2015].<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/f2016-final.pdf practice questions for final, 2016].<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/f2017-final.pdf practice questions for final, 2017] (answer key).<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/s2018-final.pdf practice questions for final, 2018] (answer key).<br />
<br />
The final is cumulative (but about 80% will be from after the midterm) so some of the questions on the midterms are also relevant:<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/f2015-midterm.pdf practice questions for midterm from 2015]. <br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/f2016-midterm.pdf practice questions for midterm from 2016].<br />
<br />
It's also good to review the quizzes, and the review points on the web pages for the lectures.</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_Review_session_for_finalClass meeting for 10-405 Review session for final2018-05-04T18:49:30Z<p>Wcohen: /* Information on the final */</p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Information on the final ===<br />
<br />
* [http://www.cs.cmu.edu/~wcohen/10-405/final-review.pptx Slides in PPT], [http://www.cs.cmu.edu/~wcohen/10-405/final-review.pdf Slides in PDF]<br />
<br />
For general tips, you should look over my slides from the midterm review session. The exam is closed book but you can take in '''two''' sheets of 8.5x11" or A4 paper (front and back). The exam is 80 minute at the usual class time and location.<br />
<br />
Practice questions from 10-605 (some questions removed, so these don't exactly reflect the length of the exam):<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/s2014-final.pdf practice questions for final, 2014].<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/s2015-final.pdf practice questions for final, 2015].<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/f2016-final.pdf practice questions for final, 2016].<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/f2017-final.pdf practice questions for final, 2017] (answer key).<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/s2018final.pdf practice questions for final, 2018] (answer key).<br />
<br />
The final is cumulative (but about 80% will be from after the midterm) so some of the questions on the midterms are also relevant:<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/f2015-midterm.pdf practice questions for midterm from 2015]. <br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/f2016-midterm.pdf practice questions for midterm from 2016].<br />
<br />
It's also good to review the quizzes, and the review points on the web pages for the lectures.</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_Review_session_for_finalClass meeting for 10-405 Review session for final2018-04-30T18:17:58Z<p>Wcohen: /* Information on the final */</p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Information on the final ===<br />
<br />
* [http://www.cs.cmu.edu/~wcohen/10-405/final-review.pptx Slides in PPT], [http://www.cs.cmu.edu/~wcohen/10-405/final-review.pdf Slides in PDF]<br />
<br />
For general tips, you should look over my slides from the midterm review session. The exam is closed book but you can take in '''two''' sheets of 8.5x11" or A4 paper (front and back). The exam is 80 minute at the usual class time and location.<br />
<br />
Practice questions from 10-605 (some questions removed, so these don't exactly reflect the length of the exam):<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/s2014-final.pdf practice questions for final, 2014].<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/s2015-final.pdf practice questions for final, 2015].<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/f2016-final.pdf practice questions for final, 2016].<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/f2017-final.pdf practice questions for final, 2017] (answer key).<br />
<br />
The final is cumulative (but about 80% will be from after the midterm) so some of the questions on the midterms are also relevant:<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/f2015-midterm.pdf practice questions for midterm from 2015]. <br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/f2016-midterm.pdf practice questions for midterm from 2016].<br />
<br />
It's also good to review the quizzes, and the review points on the web pages for the lectures.</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_Review_session_for_finalClass meeting for 10-405 Review session for final2018-04-30T18:17:38Z<p>Wcohen: /* Information on the final */</p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Information on the final ===<br />
<br />
* [http://www.cs.cmu.edu/~wcohen/10-405/final-review.pptx Slides in PPT], [http://www.cs.cmu.edu/~wcohen/10-405/final-review.pdf Slides in PDF]<br />
<br />
For general tips, you should look over my slides from the midterm review session. The exam is closed book but you can take in '''two''' sheets of 8.5x11" or A4 paper (front and back). The exam is 80 minute at the usual class time and location.<br />
<br />
Practice questions from 10-605 (some questions removed, so these don't exactly reflect the length of the exam):<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/s2014-final.pdf practice questions for final, 2014].<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/s2015-final.pdf practice questions for final, 2015].<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/f2016-final.pdf practice questions for final, 2016].<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/f2017-final.pdf practice questions for final, 2017 (answer key).<br />
<br />
The final is cumulative (but about 80% will be from after the midterm) so some of the questions on the midterms are also relevant:<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/f2015-midterm.pdf practice questions for midterm from 2015]. <br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/f2016-midterm.pdf practice questions for midterm from 2016].<br />
<br />
It's also good to review the quizzes, and the review points on the web pages for the lectures.</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Syllabus_for_Machine_Learning_with_Large_Datasets_10-405_in_Spring_2018Syllabus for Machine Learning with Large Datasets 10-405 in Spring 20182018-04-30T17:46:15Z<p>Wcohen: /* Ideas for open-ended extensions to the HW assignments */</p>
<hr />
<div>This is the syllabus for [[Machine Learning with Large Datasets 10-405 in Spring 2018]]. <br />
<br />
== Ideas for open-ended extensions to the HW assignments ==<br />
<br />
This is not a complete list! you can use any of these as a starting point, but feel free to think up your own extensions.<br />
Any open-ended extensions must be submitted no later than '''midnight May 6''' to be considered for grading.<br />
<br />
HW2 (NB in GuineaPig):<br />
<br />
* The assignment proposes one particular scheme for parallelizing the training/testing algorithm. Consider another parallelization algorithm.<br />
* Implement a similarly scalable Rocchio algorithm and compare it with NB.<br />
* Reimplement the same algorithm in Spark (or some other dataflow language) and compare.<br />
* One or the extensions to GuineaPig not discussed in class is an in-memory map-reduce system. Design an experiment that makes use of this constructively.<br />
<br />
HW3 (Logistic regression and SGD)<br />
* Evaluate the hash trick for Naive Bayes systematically on a series of datasets.<br />
* Implement a parameter-mixing version of logistic regression and evaluate it.<br />
* A [https://www.aclweb.org/anthology/P12-2018 recent paper] proposes (roughly) using SVM with NB-transformed features. Implement this and compare.<br />
* The personalization method described in class is based on [https://www.umiacs.umd.edu/~hal/docs/daume07easyadapt.pdf a transfer learning method] which works similarly. Many wikipedia pages are available in multiple languages, and works in related languages tend to be lexically similar (eg, "astrónomo" is Spanish for "astronomer"). Suppose features were character n-grams (eg "astr", "stro", "tron", ...) - does domain transfer work for the task of classifying wikipedia pages? Construct a dataset and experiment to test this hypothesis.<br />
<br />
HW4/5 (Autodiff)<br />
* Find a recent paper which includes data and an implementation for a non-trivial neural model in a standard framework (eg pyTorch or TensorFlow), and port the model to your autodiff framework.<br />
* On a machine with multiple CPUs, use the <code>multiprocessing</code> and <code>multiprocessing.pool</code> framework to parallelize gradient computation on CPUs. The architecture you should build should stream through the data, and construct multiple tasks which require a worker to perform a minibatch update, then broadcast the parameter updates to a another worker that will accumulate them. (So this system would be doing delayed SGD on minibatches). Perform some experiments to see how performance (time and accuracy!) changes as you increase the number of processors. What would be advantages of this sort of architecture over a GPU-based one?<br />
<br />
HW6 (SSL):<br />
* Implement the optimization for modified adsorption (MAD) and compare<br />
* Implement the sketch-based approach for SSL described in the paper below and compare: Partha Pratim Talukdar and William W. Cohen (2014): Scaling Graph-based Semi Supervised Learning to Large Number of Labels Using Count-Min Sketch in AI-Stats 2014.<br />
<br />
=== Notes ===<br />
<br />
* Homeworks, unless otherwise posted, will be due when the next HW comes out.<br />
* Lecture notes and/or slides will be (re)posted around the time of the lectures.<br />
<br />
=== Schedule ===<br />
<br />
* Wed Jan 17, 2018 [[Class meeting for 10-405 Overview|Overview]]. Grading policies and etc, History of Big Data, Complexity theory and cost of important operations<br />
* Mon Jan 22, 2018 [[Class meeting for 10-405 Probability Review|Probability Review]]. Counting for big data and density estimation, streaming Naive Bayes, Rocchio and TFIDF<br />
** '''Start work on''' Assignment 1a: Streaming NB; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw1a.pdf <br />
* Wed Jan 24, 2018 [[Class meeting for 10-405 Streaming Naive Bayes|Streaming Naive Bayes]]. Notes on scalable naive bayes, Alternatives to stream and sort, Local counting in stream and sort, Stream and sort examples<br />
* Mon Jan 29, 2018 [[Class meeting for 10-405 Hadoop Overview|Hadoop Overview]]. Intro to Hadoop, Hadoop Streaming, Debugging Hadoop, Combiners<br />
** '''Start work on''' Assignment 1b: Streaming NB on Hadoop; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw1b.pdf <br />
* Wed Jan 31, 2018 [[Class meeting for 10-405 Workflows For Hadoop|Workflows For Hadoop 1]]. Scalable classification, Abstracts for map-reduce algorithms, Joins in Hadoop<br />
* Mon Feb 5, 2018 [[Class meeting for 10-405 Workflows For Hadoop|Workflows For Hadoop 2]]. Guinea Pig intro, Similarity joins, Similarity joins with TFIDF, Parallel simjoins<br />
** '''Start work on''' Assignment 2a: Naive bayes training in Guinea Pig; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw2a.pdf <br />
* Wed Feb 7, 2018 [[Class meeting for 10-405 Workflows For Hadoop|Workflows For Hadoop 3]]. PageRank, PageRank in Pig and Guinea Pig, K-means in Pig, Spark, Systems built on top of Hadoop<br />
* Mon Feb 12, 2018 [[Class meeting for 10-405 SGD and Hash Kernels|SGD and Hash Kernels]]. Learning as optimization, Logistic regression with SGD, Regularized SGD, Efficient regularized SGD, Hash kernels for logistic regression<br />
** '''Start work on''' Assignment 2b: Naive bayes testing in Guinea Pig. http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw2b.pdf <br />
* Wed Feb 14, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 1]]. The "delta trick", Averaged perceptrons, Debugging ML algorithms<br />
* Mon Feb 19, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 2]]. Hash kernels, Ranking perceptrons, Structured perceptrons<br />
** '''Start work on''' Assignment 3: scalable SGD; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw3.pdf<br />
* Wed Feb 21, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 3]]. Iterative parameter mixing paper, Parallel SGD via Param Mixing<br />
* Mon Feb 26, 2018 [[Class meeting for 10-405 SGD for MF|SGD for MF]]. Matrix factorization, Matrix factorization with SGD, distributed matrix factorization with SGD<br />
* Wed Feb 28, 2018 [[Class meeting for 10-405 Guest lecture - tentative|Guest lecture]] - [http://www.cs.cmu.edu/~kijungs/ Kijung Shin]<br />
* Mon Mar 5, 2018 [[Class meeting for 10-405 Midterm review and catchup|Midterm review and catchup]]. Midterm review<br />
** '''Previous assignment due'''<br />
* Wed Mar 7, 2018 [[Class meeting for 10-405 Midterm|Midterm]]. <br />
* Mon Mar 19, 2018 [[Class meeting for 10-405 Deep Learning|Deep Learning 1]]. BackProp following Nielson, Deep learning and GPUs, Reverse-mode differentiation (autodiff)<br />
* Wed Mar 21, 2018 [[Class meeting for 10-405 Deep Learning|Deep Learning 2]]. Reverse-mode differentiation (autodiff), Some systems using autodiff, Details on Wengert lists, Breakdown of xman.py, Inputs, parameters, updates<br />
** '''Start work on''' Assignment 4: Autodiff with IPM part 1/2; Draft at http://www.cs.cmu.edu/~wcohen/10-605/assignments/2017-fall/hw-4+5-autodiff/hw4/main.pdf<br />
* Mon Mar 26, 2018 [[Class meeting for 10-405 Deep Learning|Deep Learning 3]]. Expressiveness of MLPs, Exploding and vanishing gradients, Modern deep learning models, Word2vec and GloVE, Recursive ANNs, Convolutional ANNs, Achitectures using RNNs<br />
* Wed Mar 28, 2018 [[Class meeting for 10-405 Computing with GPUs|Computing with GPUs]]. Introduction to GPUs, CUDA<br />
* Fri Mar 30, 2018<br />
** '''Start work on''' Assignment 5: Autodiff with IPM part 2/2<br />
* Mon Apr 2, 2018 [[Class meeting for 10-405 Randomized Algorithms|Randomized Algorithms 1]]. Bloom filters, The countmin sketch, CM Sketches in Deep Learning<br />
** '''HW 4 is due'''<br />
* Wed Apr 4, 2018 [[Class meeting for 10-405 Randomized Algorithms|Randomized Algorithms 2]]. Review of Bloom filters, Locality sensitive hashing, Online LSH<br />
* Mon Apr 9, 2018 [[Class meeting for 10-405 Graph Architectures for ML|Graph Architectures for ML]]. Graph-based ML architectures, Pregel, Signal-collect, GraphLab, PowerGraph, GraphChi, GraphX<br />
* Wed Apr 11, 2018 [[Class meeting for 10-405 SSL on Graphs|SSL on Graphs]]. Semi-supervised learning intro, Multirank-walk SSL method, Harmonic fields, Modified Adsorption SSL method, MAD with countmin sketches<br />
** '''Start work on''' Assignment 6: SSL on a graph in Spark maybe using NELL data?<br />
* Mon Apr 16, 2018 [[Class meeting for 10-405 LDA|LDA 1]]. DGMs for naive Bayes, Gibbs sampling for LDA<br />
* Wed Apr 18, 2018 [[Class meeting for 10-405 LDA|LDA 2]]. Parallelizing LDA, Fast sampling for LDA, DGMs for graphs<br />
* Mon Apr 23, 2018 [[Class meeting for 10-405 Unsupervised Learning On Graphs|Unsupervised Learning On Graphs]]. Spectral clustering, Power iteration clustering, Label propagation for clustering non-graph data, Label propagation for SSL on non-graph data<br />
* Wed Apr 25, 2018 [[Class meeting for 10-405 Parameter Servers|Parameter Servers]]. Parameter servers, PS vs Hadoop, State Synchronous Parallel (SSP) model, Managed Communication in PS, LDA Sampler with PS<br />
** '''Last assignment due'''<br />
* Mon Apr 30, 2018 [[Class meeting for 10-405 Review session for final|Review session for final]]. <br />
* Wed May 2, 2018 [[Class meeting for 10-405 Final Exam|Final Exam]].</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_Review_session_for_finalClass meeting for 10-405 Review session for final2018-04-27T21:30:00Z<p>Wcohen: /* Information on the final */</p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Information on the final ===<br />
<br />
For general tips, you should look over my slides from the midterm review session. The exam is closed book but you can take in '''two''' sheets of 8.5x11" or A4 paper (front and back). The exam is 80 minute at the usual class time and location.<br />
<br />
Practice questions from 10-605 (some questions removed, so these don't exactly reflect the length of the exam):<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/s2014-final.pdf practice questions for final, 2014].<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/s2015-final.pdf practice questions for final, 2015].<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/f2016-final.pdf practice questions for final, 2016].<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/f2017-final.pdf practice questions for final, 2016] (answer key).<br />
<br />
The final is cumulative (but about 80% will be from after the midterm) so some of the questions on the midterms are also relevant:<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/f2015-midterm.pdf practice questions for midterm from 2015]. <br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/f2016-midterm.pdf practice questions for midterm from 2016].<br />
<br />
It's also good to review the quizzes, and the review points on the web pages for the lectures.</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_Review_session_for_finalClass meeting for 10-405 Review session for final2018-04-25T20:44:18Z<p>Wcohen: Created page with "This is one of the class meetings on the schedule for the course Machine Learning with Large Data..."</p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Information on the final ===<br />
<br />
For general tips, you should look over my slides from the midterm review session. The exam is closed book but you can take in '''two''' sheets of 8.5x11" or A4 paper (front and back). The exam is 80 minute at the usual class time and location.<br />
<br />
Practice questions (some questions removed, so these don't exactly reflect the length of the exam):<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/s2014-final.pdf practice questions for final, 2014].<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/s2015-final.pdf practice questions for final, 2015].<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/f2016-final.pdf practice questions for final, 2016].<br />
<br />
The final is cumulative (but about 80% will be from after the midterm) so some of the questions on the midterms are also relevant:<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/f2015-midterm.pdf practice questions for midterm from 2015]. <br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/f2016-midterm.pdf practice questions for midterm from 2016].<br />
<br />
It's also good to review the quizzes, and the review points on the web pages for the lectures.</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_Unsupervised_Learning_On_GraphsClass meeting for 10-405 Unsupervised Learning On Graphs2018-04-23T18:04:05Z<p>Wcohen: /* Quiz */</p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Slides ===<br />
<br />
<br />
* [http://www.cs.cmu.edu/~wcohen/10-405/unsup-on-graphs.pptx Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/unsup-on-graphs.pdf PDF].<br />
<br />
=== Quiz ===<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/248]<br />
<br />
=== Optional Readings ===<br />
<br />
* Von Luxburg, Ulrike. "A tutorial on spectral clustering." Statistics and computing 17.4 (2007): 395-416.<br />
* Frank Lin and William W. Cohen (2010): Power Iteration Clustering in ICML-2010.<br />
* Frank Lin and William W. Cohen (2010): A Very Fast Method for Clustering Big Text Datasets in ECAI-2010.<br />
* Frank Lin and William W. Cohen (2011): Adaptation of Graph-Based Semi-Supervised Methods to Large-Scale Text Data in MLG-2011.<br />
* Ramnath Balasubramanyan, Frank Lin, and William W. Cohen (2010): Node Clustering in Graphs: An Empirical Study in NIPS-2010 Workshop on Networks Across Disciplines.<br />
<br />
=== Things To Remember ===<br />
<br />
* The definitions of the graph Laplacian (D-A) and normalized Laplacian (I-W)<br />
* What the largest eigenvectors of W look like for a block-stochastic matrix<br />
* What spectral clustering is: clustering after mapping nodes in a graph to points defined by the largest K non-trivial eigenvectors of W.<br />
* What power iteration clustering is.<br />
* How to implement the "manifold trick" for PIC and SSL.<br />
* Why the "manifold trick" improves computational efficiency, relative to computing a K-NN graph.</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Syllabus_for_Machine_Learning_with_Large_Datasets_10-405_in_Spring_2018Syllabus for Machine Learning with Large Datasets 10-405 in Spring 20182018-04-23T14:50:41Z<p>Wcohen: /* Schedule */</p>
<hr />
<div>This is the syllabus for [[Machine Learning with Large Datasets 10-405 in Spring 2018]]. <br />
<br />
== Ideas for open-ended extensions to the HW assignments ==<br />
<br />
This is not a complete list! you can use any of these as a starting point, but feel free to think up your own extensions.<br />
<br />
HW2 (NB in GuineaPig):<br />
<br />
* The assignment proposes one particular scheme for parallelizing the training/testing algorithm. Consider another parallelization algorithm.<br />
* Implement a similarly scalable Rocchio algorithm and compare it with NB.<br />
* Reimplement the same algorithm in Spark (or some other dataflow language) and compare.<br />
* One or the extensions to GuineaPig not discussed in class is an in-memory map-reduce system. Design an experiment that makes use of this constructively.<br />
<br />
HW3 (Logistic regression and SGD)<br />
* Evaluate the hash trick for Naive Bayes systematically on a series of datasets.<br />
* Implement a parameter-mixing version of logistic regression and evaluate it.<br />
* A [https://www.aclweb.org/anthology/P12-2018 recent paper] proposes (roughly) using SVM with NB-transformed features. Implement this and compare.<br />
* The personalization method described in class is based on [https://www.umiacs.umd.edu/~hal/docs/daume07easyadapt.pdf a transfer learning method] which works similarly. Many wikipedia pages are available in multiple languages, and works in related languages tend to be lexically similar (eg, "astrónomo" is Spanish for "astronomer"). Suppose features were character n-grams (eg "astr", "stro", "tron", ...) - does domain transfer work for the task of classifying wikipedia pages? Construct a dataset and experiment to test this hypothesis.<br />
<br />
HW4/5 (Autodiff)<br />
* Find a recent paper which includes data and an implementation for a non-trivial neural model in a standard framework (eg pyTorch or TensorFlow), and port the model to your autodiff framework.<br />
* On a machine with multiple CPUs, use the <code>multiprocessing</code> and <code>multiprocessing.pool</code> framework to parallelize gradient computation on CPUs. The architecture you should build should stream through the data, and construct multiple tasks which require a worker to perform a minibatch update, then broadcast the parameter updates to a another worker that will accumulate them. (So this system would be doing delayed SGD on minibatches). Perform some experiments to see how performance (time and accuracy!) changes as you increase the number of processors. What would be advantages of this sort of architecture over a GPU-based one?<br />
<br />
HW6 (SSL):<br />
* Implement the optimization for modified adsorption (MAD) and compare<br />
* Implement the sketch-based approach for SSL described in the paper below and compare: Partha Pratim Talukdar and William W. Cohen (2014): Scaling Graph-based Semi Supervised Learning to Large Number of Labels Using Count-Min Sketch in AI-Stats 2014.<br />
<br />
=== Notes ===<br />
<br />
* Homeworks, unless otherwise posted, will be due when the next HW comes out.<br />
* Lecture notes and/or slides will be (re)posted around the time of the lectures.<br />
<br />
=== Schedule ===<br />
<br />
* Wed Jan 17, 2018 [[Class meeting for 10-405 Overview|Overview]]. Grading policies and etc, History of Big Data, Complexity theory and cost of important operations<br />
* Mon Jan 22, 2018 [[Class meeting for 10-405 Probability Review|Probability Review]]. Counting for big data and density estimation, streaming Naive Bayes, Rocchio and TFIDF<br />
** '''Start work on''' Assignment 1a: Streaming NB; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw1a.pdf <br />
* Wed Jan 24, 2018 [[Class meeting for 10-405 Streaming Naive Bayes|Streaming Naive Bayes]]. Notes on scalable naive bayes, Alternatives to stream and sort, Local counting in stream and sort, Stream and sort examples<br />
* Mon Jan 29, 2018 [[Class meeting for 10-405 Hadoop Overview|Hadoop Overview]]. Intro to Hadoop, Hadoop Streaming, Debugging Hadoop, Combiners<br />
** '''Start work on''' Assignment 1b: Streaming NB on Hadoop; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw1b.pdf <br />
* Wed Jan 31, 2018 [[Class meeting for 10-405 Workflows For Hadoop|Workflows For Hadoop 1]]. Scalable classification, Abstracts for map-reduce algorithms, Joins in Hadoop<br />
* Mon Feb 5, 2018 [[Class meeting for 10-405 Workflows For Hadoop|Workflows For Hadoop 2]]. Guinea Pig intro, Similarity joins, Similarity joins with TFIDF, Parallel simjoins<br />
** '''Start work on''' Assignment 2a: Naive bayes training in Guinea Pig; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw2a.pdf <br />
* Wed Feb 7, 2018 [[Class meeting for 10-405 Workflows For Hadoop|Workflows For Hadoop 3]]. PageRank, PageRank in Pig and Guinea Pig, K-means in Pig, Spark, Systems built on top of Hadoop<br />
* Mon Feb 12, 2018 [[Class meeting for 10-405 SGD and Hash Kernels|SGD and Hash Kernels]]. Learning as optimization, Logistic regression with SGD, Regularized SGD, Efficient regularized SGD, Hash kernels for logistic regression<br />
** '''Start work on''' Assignment 2b: Naive bayes testing in Guinea Pig. http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw2b.pdf <br />
* Wed Feb 14, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 1]]. The "delta trick", Averaged perceptrons, Debugging ML algorithms<br />
* Mon Feb 19, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 2]]. Hash kernels, Ranking perceptrons, Structured perceptrons<br />
** '''Start work on''' Assignment 3: scalable SGD; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw3.pdf<br />
* Wed Feb 21, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 3]]. Iterative parameter mixing paper, Parallel SGD via Param Mixing<br />
* Mon Feb 26, 2018 [[Class meeting for 10-405 SGD for MF|SGD for MF]]. Matrix factorization, Matrix factorization with SGD, distributed matrix factorization with SGD<br />
* Wed Feb 28, 2018 [[Class meeting for 10-405 Guest lecture - tentative|Guest lecture]] - [http://www.cs.cmu.edu/~kijungs/ Kijung Shin]<br />
* Mon Mar 5, 2018 [[Class meeting for 10-405 Midterm review and catchup|Midterm review and catchup]]. Midterm review<br />
** '''Previous assignment due'''<br />
* Wed Mar 7, 2018 [[Class meeting for 10-405 Midterm|Midterm]]. <br />
* Mon Mar 19, 2018 [[Class meeting for 10-405 Deep Learning|Deep Learning 1]]. BackProp following Nielson, Deep learning and GPUs, Reverse-mode differentiation (autodiff)<br />
* Wed Mar 21, 2018 [[Class meeting for 10-405 Deep Learning|Deep Learning 2]]. Reverse-mode differentiation (autodiff), Some systems using autodiff, Details on Wengert lists, Breakdown of xman.py, Inputs, parameters, updates<br />
** '''Start work on''' Assignment 4: Autodiff with IPM part 1/2; Draft at http://www.cs.cmu.edu/~wcohen/10-605/assignments/2017-fall/hw-4+5-autodiff/hw4/main.pdf<br />
* Mon Mar 26, 2018 [[Class meeting for 10-405 Deep Learning|Deep Learning 3]]. Expressiveness of MLPs, Exploding and vanishing gradients, Modern deep learning models, Word2vec and GloVE, Recursive ANNs, Convolutional ANNs, Achitectures using RNNs<br />
* Wed Mar 28, 2018 [[Class meeting for 10-405 Computing with GPUs|Computing with GPUs]]. Introduction to GPUs, CUDA<br />
* Fri Mar 30, 2018<br />
** '''Start work on''' Assignment 5: Autodiff with IPM part 2/2<br />
* Mon Apr 2, 2018 [[Class meeting for 10-405 Randomized Algorithms|Randomized Algorithms 1]]. Bloom filters, The countmin sketch, CM Sketches in Deep Learning<br />
** '''HW 4 is due'''<br />
* Wed Apr 4, 2018 [[Class meeting for 10-405 Randomized Algorithms|Randomized Algorithms 2]]. Review of Bloom filters, Locality sensitive hashing, Online LSH<br />
* Mon Apr 9, 2018 [[Class meeting for 10-405 Graph Architectures for ML|Graph Architectures for ML]]. Graph-based ML architectures, Pregel, Signal-collect, GraphLab, PowerGraph, GraphChi, GraphX<br />
* Wed Apr 11, 2018 [[Class meeting for 10-405 SSL on Graphs|SSL on Graphs]]. Semi-supervised learning intro, Multirank-walk SSL method, Harmonic fields, Modified Adsorption SSL method, MAD with countmin sketches<br />
** '''Start work on''' Assignment 6: SSL on a graph in Spark maybe using NELL data?<br />
* Mon Apr 16, 2018 [[Class meeting for 10-405 LDA|LDA 1]]. DGMs for naive Bayes, Gibbs sampling for LDA<br />
* Wed Apr 18, 2018 [[Class meeting for 10-405 LDA|LDA 2]]. Parallelizing LDA, Fast sampling for LDA, DGMs for graphs<br />
* Mon Apr 23, 2018 [[Class meeting for 10-405 Unsupervised Learning On Graphs|Unsupervised Learning On Graphs]]. Spectral clustering, Power iteration clustering, Label propagation for clustering non-graph data, Label propagation for SSL on non-graph data<br />
* Wed Apr 25, 2018 [[Class meeting for 10-405 Parameter Servers|Parameter Servers]]. Parameter servers, PS vs Hadoop, State Synchronous Parallel (SSP) model, Managed Communication in PS, LDA Sampler with PS<br />
** '''Last assignment due'''<br />
* Mon Apr 30, 2018 [[Class meeting for 10-405 Review session for final|Review session for final]]. <br />
* Wed May 2, 2018 [[Class meeting for 10-405 Final Exam|Final Exam]].</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_Randomized_AlgorithmsClass meeting for 10-405 Randomized Algorithms2018-04-23T14:43:15Z<p>Wcohen: /* Also discussed */</p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Slides ===<br />
<br />
* Lecture 1 [http://www.cs.cmu.edu/~wcohen/10-405/randomized-1.pptx Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/randomized-1.pdf PDF]. <br />
* Lecture 2 [http://www.cs.cmu.edu/~wcohen/10-405/randomized-2.pptx Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/randomized-2.pdf PDF].<br />
<br />
=== Quizzes ===<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/83 Quiz for lecture 1]<br />
* [https://qna.cs.cmu.edu/#/pages/view/217 Quiz for lecture 2]<br />
<br />
=== Sample Code ===<br />
<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/bloomfilter.py Python demo code for Bloom filter]<br />
<br />
=== Readings ===<br />
<br />
* William's [http://www.cs.cmu.edu/~wcohen/10-605/notes/randomized-algs.pdf lecture notes on randomized algorithms] (covering Bloom filters and countmin sketches).<br />
* [http://www.cs.jhu.edu/~vandurme/papers/VanDurmeLallACL10.pdf Online Generation of Locality Sensitive Hash Signatures]. Benjamin Van Durme and Ashwin Lall. ACL Short. 2010<br />
<br />
<br />
=== Optional Readings ===<br />
<br />
* [http://dl.acm.org/citation.cfm?id=1219840.1219917 Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering] Deepak Ravichandran, Patrick Pantel, and Eduard Hovy<br />
* [http://www.umiacs.umd.edu/~amit/Papers/goyalPointQueryEMNLP12.pdf Sketch Algorithms for Estimating Point Queries in NLP.] Amit Goyal, Hal Daume III, and Graham Cormode, EMNLP 2012]<br />
<br />
=== Also discussed ===<br />
<br />
* [https://openreview.net/forum?id=r1br_2Kge Short and Deep: Sketching and Neural Networks: Amit Daniely, Nevena Lazic, Yoram Singer, Kunal Talwar, ICLR 2017]<br />
* [https://openreview.net/pdf?id=rkKCdAdgx Compact Embedding of Binary-coded Inputs and Outputs using Bloom Filters, Serra & Alexandros Karatzoglou 2017]<br />
* [https://dl.acm.org/citation.cfm?id=3078983 Lin, Jie, et al. "DeepHash for Image Instance Retrieval: Getting Regularization, Depth and Fine-Tuning Right." Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval. ACM, 2017.]<br />
<br />
=== Key things to remember ===<br />
<br />
* The API for the randomized methods we studied: Bloom filters, LSH, CM sketches, and LSH.<br />
* The benefits of the online LSH method.<br />
* The key algorithmic ideas behind these methods: random projections, hashing and allowing collisions, controlling probability of collisions with multiple hashes, and use of pooling to avoid storing many randomly-created objects.<br />
* When you would use which technique.<br />
* The relationship between hash kernels and CM sketches.<br />
* What are the key tradeoffs associated with these methods, in terms of space/time efficiency and accuracy, and what sorts of errors are made by which algorithms (e.g., if they give over/under estimates, false positives/false negatives, etc).<br />
* What guarantees are possible, and how space grows as you require more accuracy.<br />
* Which algorithms allow one to combine sketches easily (i.e., when are the sketches additive).</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_LDAClass meeting for 10-405 LDA2018-04-16T14:23:51Z<p>Wcohen: /* Readings */</p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Slides ===<br />
<br />
* Lecture 1: [http://www.cs.cmu.edu/~wcohen/10-405/lda-1.pptx Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/lda-1.pdf PDF].<br />
* Lecture 2: [http://www.cs.cmu.edu/~wcohen/10-405/lda-2.pptx Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/lda-2.pdf PDF].<br />
<br />
=== Quiz ===<br />
<br />
* No quiz for lecture 1<br />
* [https://qna.cs.cmu.edu/#/pages/view/105 Quiz for lecture 2]<br />
<br />
=== Readings ===<br />
<br />
Basic LDA:<br />
<br />
* Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent Dirichlet allocation." Journal of machine Learning research 3.Jan (2003): 993-1022.<br />
<br />
<br />
Speedups for LDA:<br />
<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/notes/lda.pdf William's notes on fast sampling for LDA]<br />
<br />
=== Optional Readings ===<br />
<br />
* [http://jmlr.csail.mit.edu/papers/volume10/newman09a/newman09a.pdf Distributed Algorithms for Topic Models], Newman et al, JMLR 2009.<br />
* [http://people.cs.umass.edu/~mimno/papers/fast-topic-model.pdf Efficient Methods for Topic Model Inference on Streaming Document Collections], Yao, Mimno, McCallum KDD 2009.<br />
* [http://dl.acm.org/citation.cfm?id=2623756 Reducing the sampling complexity of topic models], Li, Ahmed, Ravi, & Smola, KDD 2014<br />
* [https://dl.acm.org/citation.cfm?id=2741682 A Scalable Asynchronous Distributed Algorithm for Topic Modeling], Yu, Hsieh, Yun, Vishwanathan, Dillon, WWW 2015<br />
<br />
=== Things to remember ===<br />
<br />
* How Gibbs sampling is used to sample from a model.<br />
* The "generative story" associated with key models like LDA, naive Bayes, and stochastic block models.<br />
* What a "mixed membership" generative model is.<br />
* The time complexity and storage requirements of Gibbs sampling for LDAs.<br />
* How LDA learning can be sped up using IPM approaches.<br />
<br />
* Why efficient sampling is important for LDAs<br />
* How sampling can be sped up for many topics by preprocessing the parameters of the distribution<br />
* How the storage used for LDA can be reduced by exploiting the fact that many words are rare.</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_Randomized_AlgorithmsClass meeting for 10-405 Randomized Algorithms2018-04-02T18:35:22Z<p>Wcohen: </p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Slides ===<br />
<br />
* Lecture 1 [http://www.cs.cmu.edu/~wcohen/10-405/randomized-1.pptx Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/randomized-1.pdf PDF]. <br />
* Lecture 2 [http://www.cs.cmu.edu/~wcohen/10-405/randomized-2.pptx Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/randomized-2.pdf PDF].<br />
<br />
=== Quizzes ===<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/83 Quiz for lecture 1]<br />
* [https://qna.cs.cmu.edu/#/pages/view/217 Quiz for lecture 2]<br />
<br />
=== Sample Code ===<br />
<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/bloomfilter.py Python demo code for Bloom filter]<br />
<br />
=== Readings ===<br />
<br />
* William's [http://www.cs.cmu.edu/~wcohen/10-605/notes/randomized-algs.pdf lecture notes on randomized algorithms] (covering Bloom filters and countmin sketches).<br />
* [http://www.cs.jhu.edu/~vandurme/papers/VanDurmeLallACL10.pdf Online Generation of Locality Sensitive Hash Signatures]. Benjamin Van Durme and Ashwin Lall. ACL Short. 2010<br />
<br />
<br />
=== Optional Readings ===<br />
<br />
* [http://dl.acm.org/citation.cfm?id=1219840.1219917 Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering] Deepak Ravichandran, Patrick Pantel, and Eduard Hovy<br />
* [http://www.umiacs.umd.edu/~amit/Papers/goyalPointQueryEMNLP12.pdf Sketch Algorithms for Estimating Point Queries in NLP.] Amit Goyal, Hal Daume III, and Graham Cormode, EMNLP 2012]<br />
<br />
=== Also discussed ===<br />
<br />
* [https://openreview.net/forum?id=r1br_2Kge Short and Deep: Sketching and Neural Networks: Amit Daniely, Nevena Lazic, Yoram Singer, Kunal Talwar, ICLR 2017]<br />
* [https://dl.acm.org/citation.cfm?id=3078983 Lin, Jie, et al. "DeepHash for Image Instance Retrieval: Getting Regularization, Depth and Fine-Tuning Right." Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval. ACM, 2017.]<br />
<br />
=== Key things to remember ===<br />
<br />
* The API for the randomized methods we studied: Bloom filters, LSH, CM sketches, and LSH.<br />
* The benefits of the online LSH method.<br />
* The key algorithmic ideas behind these methods: random projections, hashing and allowing collisions, controlling probability of collisions with multiple hashes, and use of pooling to avoid storing many randomly-created objects.<br />
* When you would use which technique.<br />
* The relationship between hash kernels and CM sketches.<br />
* What are the key tradeoffs associated with these methods, in terms of space/time efficiency and accuracy, and what sorts of errors are made by which algorithms (e.g., if they give over/under estimates, false positives/false negatives, etc).<br />
* What guarantees are possible, and how space grows as you require more accuracy.<br />
* Which algorithms allow one to combine sketches easily (i.e., when are the sketches additive).</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Syllabus_for_Machine_Learning_with_Large_Datasets_10-405_in_Spring_2018Syllabus for Machine Learning with Large Datasets 10-405 in Spring 20182018-04-02T15:33:05Z<p>Wcohen: /* Ideas for open-ended extensions to the HW assignments */</p>
<hr />
<div>This is the syllabus for [[Machine Learning with Large Datasets 10-405 in Spring 2018]]. <br />
<br />
== Ideas for open-ended extensions to the HW assignments ==<br />
<br />
This is not a complete list! you can use any of these as a starting point, but feel free to think up your own extensions.<br />
<br />
HW2 (NB in GuineaPig):<br />
<br />
* The assignment proposes one particular scheme for parallelizing the training/testing algorithm. Consider another parallelization algorithm.<br />
* Implement a similarly scalable Rocchio algorithm and compare it with NB.<br />
* Reimplement the same algorithm in Spark (or some other dataflow language) and compare.<br />
* One or the extensions to GuineaPig not discussed in class is an in-memory map-reduce system. Design an experiment that makes use of this constructively.<br />
<br />
HW3 (Logistic regression and SGD)<br />
* Evaluate the hash trick for Naive Bayes systematically on a series of datasets.<br />
* Implement a parameter-mixing version of logistic regression and evaluate it.<br />
* A [https://www.aclweb.org/anthology/P12-2018 recent paper] proposes (roughly) using SVM with NB-transformed features. Implement this and compare.<br />
* The personalization method described in class is based on [https://www.umiacs.umd.edu/~hal/docs/daume07easyadapt.pdf a transfer learning method] which works similarly. Many wikipedia pages are available in multiple languages, and works in related languages tend to be lexically similar (eg, "astrónomo" is Spanish for "astronomer"). Suppose features were character n-grams (eg "astr", "stro", "tron", ...) - does domain transfer work for the task of classifying wikipedia pages? Construct a dataset and experiment to test this hypothesis.<br />
<br />
HW4/5 (Autodiff)<br />
* Find a recent paper which includes data and an implementation for a non-trivial neural model in a standard framework (eg pyTorch or TensorFlow), and port the model to your autodiff framework.<br />
* On a machine with multiple CPUs, use the <code>multiprocessing</code> and <code>multiprocessing.pool</code> framework to parallelize gradient computation on CPUs. The architecture you should build should stream through the data, and construct multiple tasks which require a worker to perform a minibatch update, then broadcast the parameter updates to a another worker that will accumulate them. (So this system would be doing delayed SGD on minibatches). Perform some experiments to see how performance (time and accuracy!) changes as you increase the number of processors. What would be advantages of this sort of architecture over a GPU-based one?<br />
<br />
HW6 (SSL):<br />
* Implement the optimization for modified adsorption (MAD) and compare<br />
* Implement the sketch-based approach for SSL described in the paper below and compare: Partha Pratim Talukdar and William W. Cohen (2014): Scaling Graph-based Semi Supervised Learning to Large Number of Labels Using Count-Min Sketch in AI-Stats 2014.<br />
<br />
=== Notes ===<br />
<br />
* Homeworks, unless otherwise posted, will be due when the next HW comes out.<br />
* Lecture notes and/or slides will be (re)posted around the time of the lectures.<br />
<br />
=== Schedule ===<br />
<br />
* Wed Jan 17, 2018 [[Class meeting for 10-405 Overview|Overview]]. Grading policies and etc, History of Big Data, Complexity theory and cost of important operations<br />
* Mon Jan 22, 2018 [[Class meeting for 10-405 Probability Review|Probability Review]]. Counting for big data and density estimation, streaming Naive Bayes, Rocchio and TFIDF<br />
** '''Start work on''' Assignment 1a: Streaming NB; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw1a.pdf <br />
* Wed Jan 24, 2018 [[Class meeting for 10-405 Streaming Naive Bayes|Streaming Naive Bayes]]. Notes on scalable naive bayes, Alternatives to stream and sort, Local counting in stream and sort, Stream and sort examples<br />
* Mon Jan 29, 2018 [[Class meeting for 10-405 Hadoop Overview|Hadoop Overview]]. Intro to Hadoop, Hadoop Streaming, Debugging Hadoop, Combiners<br />
** '''Start work on''' Assignment 1b: Streaming NB on Hadoop; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw1b.pdf <br />
* Wed Jan 31, 2018 [[Class meeting for 10-405 Workflows For Hadoop|Workflows For Hadoop 1]]. Scalable classification, Abstracts for map-reduce algorithms, Joins in Hadoop<br />
* Mon Feb 5, 2018 [[Class meeting for 10-405 Workflows For Hadoop|Workflows For Hadoop 2]]. Guinea Pig intro, Similarity joins, Similarity joins with TFIDF, Parallel simjoins<br />
** '''Start work on''' Assignment 2a: Naive bayes training in Guinea Pig; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw2a.pdf <br />
* Wed Feb 7, 2018 [[Class meeting for 10-405 Workflows For Hadoop|Workflows For Hadoop 3]]. PageRank, PageRank in Pig and Guinea Pig, K-means in Pig, Spark, Systems built on top of Hadoop<br />
* Mon Feb 12, 2018 [[Class meeting for 10-405 SGD and Hash Kernels|SGD and Hash Kernels]]. Learning as optimization, Logistic regression with SGD, Regularized SGD, Efficient regularized SGD, Hash kernels for logistic regression<br />
** '''Start work on''' Assignment 2b: Naive bayes testing in Guinea Pig. http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw2b.pdf <br />
* Wed Feb 14, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 1]]. The "delta trick", Averaged perceptrons, Debugging ML algorithms<br />
* Mon Feb 19, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 2]]. Hash kernels, Ranking perceptrons, Structured perceptrons<br />
** '''Start work on''' Assignment 3: scalable SGD; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw3.pdf<br />
* Wed Feb 21, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 3]]. Iterative parameter mixing paper, Parallel SGD via Param Mixing<br />
* Mon Feb 26, 2018 [[Class meeting for 10-405 SGD for MF|SGD for MF]]. Matrix factorization, Matrix factorization with SGD, distributed matrix factorization with SGD<br />
* Wed Feb 28, 2018 [[Class meeting for 10-405 Guest lecture - tentative|Guest lecture]] - [http://www.cs.cmu.edu/~kijungs/ Kijung Shin]<br />
* Mon Mar 5, 2018 [[Class meeting for 10-405 Midterm review and catchup|Midterm review and catchup]]. Midterm review<br />
** '''Previous assignment due'''<br />
* Wed Mar 7, 2018 [[Class meeting for 10-405 Midterm|Midterm]]. <br />
* Mon Mar 19, 2018 [[Class meeting for 10-405 Deep Learning|Deep Learning 1]]. BackProp following Nielson, Deep learning and GPUs, Reverse-mode differentiation (autodiff)<br />
* Wed Mar 21, 2018 [[Class meeting for 10-405 Deep Learning|Deep Learning 2]]. Reverse-mode differentiation (autodiff), Some systems using autodiff, Details on Wengert lists, Breakdown of xman.py, Inputs, parameters, updates<br />
** '''Start work on''' Assignment 4: Autodiff with IPM part 1/2; Draft at http://www.cs.cmu.edu/~wcohen/10-605/assignments/2017-fall/hw-4+5-autodiff/hw4/main.pdf<br />
* Mon Mar 26, 2018 [[Class meeting for 10-405 Deep Learning|Deep Learning 3]]. Expressiveness of MLPs, Exploding and vanishing gradients, Modern deep learning models, Word2vec and GloVE, Recursive ANNs, Convolutional ANNs, Achitectures using RNNs<br />
* Wed Mar 28, 2018 [[Class meeting for 10-405 Computing with GPUs|Computing with GPUs]]. Introduction to GPUs, CUDA<br />
* Fri Mar 30, 2018<br />
** '''Start work on''' Assignment 5: Autodiff with IPM part 2/2<br />
* Mon Apr 2, 2018 [[Class meeting for 10-405 Randomized Algorithms|Randomized Algorithms 1]]. Bloom filters, The countmin sketch, CM Sketches in Deep Learning<br />
** '''HW 4 is due'''<br />
* Wed Apr 4, 2018 [[Class meeting for 10-405 Randomized Algorithms|Randomized Algorithms 2]]. Review of Bloom filters, Locality sensitive hashing, Online LSH<br />
* Mon Apr 9, 2018 [[Class meeting for 10-405 Graph Architectures for ML|Graph Architectures for ML]]. Graph-based ML architectures, Pregel, Signal-collect, GraphLab, PowerGraph, GraphChi, GraphX<br />
* Wed Apr 11, 2018 [[Class meeting for 10-405 SSL on Graphs|SSL on Graphs]]. Semi-supervised learning intro, Multirank-walk SSL method, Harmonic fields, Modified Adsorption SSL method, MAD with countmin sketches<br />
** '''Start work on''' Assignment 6: SSL on a graph in Spark maybe using NELL data?<br />
* Mon Apr 16, 2018 [[Class meeting for 10-405 LDA|LDA 1]]. DGMs for naive Bayes, Gibbs sampling for LDA<br />
* Wed Apr 18, 2018 [[Class meeting for 10-405 LDA|LDA 2]]. Parallelizing LDA, Fast sampling for LDA, DGMs for graphs<br />
* Mon Apr 23, 2018 [[Class meeting for 10-405 Parameter Servers|Parameter Servers]]. Parameter servers, PS vs Hadoop, State Synchronous Parallel (SSP) model, Managed Communication in PS, LDA Sampler with PS<br />
* Wed Apr 25, 2018 [[Class meeting for 10-405 Unsupervised Learning On Graphs|Unsupervised Learning On Graphs]]. Spectral clustering, Power iteration clustering, Label propagation for clustering non-graph data, Label propagation for SSL on non-graph data<br />
** '''Last assignment due'''<br />
* Mon Apr 30, 2018 [[Class meeting for 10-405 Review session for final|Review session for final]]. <br />
* Wed May 2, 2018 [[Class meeting for 10-405 Final Exam|Final Exam]].</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Syllabus_for_Machine_Learning_with_Large_Datasets_10-405_in_Spring_2018Syllabus for Machine Learning with Large Datasets 10-405 in Spring 20182018-04-02T14:05:41Z<p>Wcohen: /* Ideas for open-ended extensions to the HW assignments */</p>
<hr />
<div>This is the syllabus for [[Machine Learning with Large Datasets 10-405 in Spring 2018]]. <br />
<br />
== Ideas for open-ended extensions to the HW assignments ==<br />
<br />
This is not a complete list! you can use any of these as a starting point, but feel free to think up your own extensions.<br />
<br />
HW2 (NB in GuineaPig):<br />
<br />
* The assignment proposes one particular scheme for parallelizing the training/testing algorithm. Consider another parallelization algorithm.<br />
* Implement a similarly scalable Rocchio algorithm and compare it with NB.<br />
* Reimplement the same algorithm in Spark (or some other dataflow language) and compare.<br />
* One or the extensions to GuineaPig not discussed in class is an in-memory map-reduce system. Design an experiment that makes use of this constructively.<br />
<br />
HW3 (Logistic regression and SGD)<br />
* Evaluate the hash trick for Naive Bayes systematically on a series of datasets.<br />
* Implement a parameter-mixing version of logistic regression and evaluate it.<br />
* A [https://www.aclweb.org/anthology/P12-2018 recent paper] proposes (roughly) using SVM with NB-transformed features. Implement this and compare.<br />
* The personalization method described in class is based on [https://www.umiacs.umd.edu/~hal/docs/daume07easyadapt.pdf a transfer learning method] which works similarly. Many wikipedia pages are available in multiple languages, and works in related languages tend to be lexically similar (eg, "astrónomo" is Spanish for "astronomer"). Suppose features were character n-grams (eg "astr", "stro", "tron", ...) - does domain transfer work for the task of classifying wikipedia pages? Construct a dataset and experiment to test this hypothesis.<br />
<br />
HW4/5 (Autodiff)<br />
* Find a recent paper which includes data and an implementation for a non-trivial neural model in a standard framework (eg pyTorch or TensorFlow), and port the model to your autodiff framework.<br />
* On a machine with multiple CPUs, use the <code>multiprocessing</code> and <code>multiprocessing.pool</code> framework to parallelize gradient computation on CPUs. The architecture you should build should stream through the data, and construct multiple tasks which require a worker to perform a minibatch update, then broadcast the parameter updates to a another worker that will accumulate them. (So this system would be doing delayed SGD on minibatches). Perform some experiments to see how performance (time and accuracy!) changes as you increase the number of processors. What would be advantages of this sort of architecture over a GPU-based one?<br />
<br />
=== Notes ===<br />
<br />
* Homeworks, unless otherwise posted, will be due when the next HW comes out.<br />
* Lecture notes and/or slides will be (re)posted around the time of the lectures.<br />
<br />
=== Schedule ===<br />
<br />
* Wed Jan 17, 2018 [[Class meeting for 10-405 Overview|Overview]]. Grading policies and etc, History of Big Data, Complexity theory and cost of important operations<br />
* Mon Jan 22, 2018 [[Class meeting for 10-405 Probability Review|Probability Review]]. Counting for big data and density estimation, streaming Naive Bayes, Rocchio and TFIDF<br />
** '''Start work on''' Assignment 1a: Streaming NB; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw1a.pdf <br />
* Wed Jan 24, 2018 [[Class meeting for 10-405 Streaming Naive Bayes|Streaming Naive Bayes]]. Notes on scalable naive bayes, Alternatives to stream and sort, Local counting in stream and sort, Stream and sort examples<br />
* Mon Jan 29, 2018 [[Class meeting for 10-405 Hadoop Overview|Hadoop Overview]]. Intro to Hadoop, Hadoop Streaming, Debugging Hadoop, Combiners<br />
** '''Start work on''' Assignment 1b: Streaming NB on Hadoop; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw1b.pdf <br />
* Wed Jan 31, 2018 [[Class meeting for 10-405 Workflows For Hadoop|Workflows For Hadoop 1]]. Scalable classification, Abstracts for map-reduce algorithms, Joins in Hadoop<br />
* Mon Feb 5, 2018 [[Class meeting for 10-405 Workflows For Hadoop|Workflows For Hadoop 2]]. Guinea Pig intro, Similarity joins, Similarity joins with TFIDF, Parallel simjoins<br />
** '''Start work on''' Assignment 2a: Naive bayes training in Guinea Pig; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw2a.pdf <br />
* Wed Feb 7, 2018 [[Class meeting for 10-405 Workflows For Hadoop|Workflows For Hadoop 3]]. PageRank, PageRank in Pig and Guinea Pig, K-means in Pig, Spark, Systems built on top of Hadoop<br />
* Mon Feb 12, 2018 [[Class meeting for 10-405 SGD and Hash Kernels|SGD and Hash Kernels]]. Learning as optimization, Logistic regression with SGD, Regularized SGD, Efficient regularized SGD, Hash kernels for logistic regression<br />
** '''Start work on''' Assignment 2b: Naive bayes testing in Guinea Pig. http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw2b.pdf <br />
* Wed Feb 14, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 1]]. The "delta trick", Averaged perceptrons, Debugging ML algorithms<br />
* Mon Feb 19, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 2]]. Hash kernels, Ranking perceptrons, Structured perceptrons<br />
** '''Start work on''' Assignment 3: scalable SGD; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw3.pdf<br />
* Wed Feb 21, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 3]]. Iterative parameter mixing paper, Parallel SGD via Param Mixing<br />
* Mon Feb 26, 2018 [[Class meeting for 10-405 SGD for MF|SGD for MF]]. Matrix factorization, Matrix factorization with SGD, distributed matrix factorization with SGD<br />
* Wed Feb 28, 2018 [[Class meeting for 10-405 Guest lecture - tentative|Guest lecture]] - [http://www.cs.cmu.edu/~kijungs/ Kijung Shin]<br />
* Mon Mar 5, 2018 [[Class meeting for 10-405 Midterm review and catchup|Midterm review and catchup]]. Midterm review<br />
** '''Previous assignment due'''<br />
* Wed Mar 7, 2018 [[Class meeting for 10-405 Midterm|Midterm]]. <br />
* Mon Mar 19, 2018 [[Class meeting for 10-405 Deep Learning|Deep Learning 1]]. BackProp following Nielson, Deep learning and GPUs, Reverse-mode differentiation (autodiff)<br />
* Wed Mar 21, 2018 [[Class meeting for 10-405 Deep Learning|Deep Learning 2]]. Reverse-mode differentiation (autodiff), Some systems using autodiff, Details on Wengert lists, Breakdown of xman.py, Inputs, parameters, updates<br />
** '''Start work on''' Assignment 4: Autodiff with IPM part 1/2; Draft at http://www.cs.cmu.edu/~wcohen/10-605/assignments/2017-fall/hw-4+5-autodiff/hw4/main.pdf<br />
* Mon Mar 26, 2018 [[Class meeting for 10-405 Deep Learning|Deep Learning 3]]. Expressiveness of MLPs, Exploding and vanishing gradients, Modern deep learning models, Word2vec and GloVE, Recursive ANNs, Convolutional ANNs, Achitectures using RNNs<br />
* Wed Mar 28, 2018 [[Class meeting for 10-405 Computing with GPUs|Computing with GPUs]]. Introduction to GPUs, CUDA<br />
* Fri Mar 30, 2018<br />
** '''Start work on''' Assignment 5: Autodiff with IPM part 2/2<br />
* Mon Apr 2, 2018 [[Class meeting for 10-405 Randomized Algorithms|Randomized Algorithms 1]]. Bloom filters, The countmin sketch, CM Sketches in Deep Learning<br />
** '''HW 4 is due'''<br />
* Wed Apr 4, 2018 [[Class meeting for 10-405 Randomized Algorithms|Randomized Algorithms 2]]. Review of Bloom filters, Locality sensitive hashing, Online LSH<br />
* Mon Apr 9, 2018 [[Class meeting for 10-405 Graph Architectures for ML|Graph Architectures for ML]]. Graph-based ML architectures, Pregel, Signal-collect, GraphLab, PowerGraph, GraphChi, GraphX<br />
* Wed Apr 11, 2018 [[Class meeting for 10-405 SSL on Graphs|SSL on Graphs]]. Semi-supervised learning intro, Multirank-walk SSL method, Harmonic fields, Modified Adsorption SSL method, MAD with countmin sketches<br />
** '''Start work on''' Assignment 6: SSL on a graph in Spark maybe using NELL data?<br />
* Mon Apr 16, 2018 [[Class meeting for 10-405 LDA|LDA 1]]. DGMs for naive Bayes, Gibbs sampling for LDA<br />
* Wed Apr 18, 2018 [[Class meeting for 10-405 LDA|LDA 2]]. Parallelizing LDA, Fast sampling for LDA, DGMs for graphs<br />
* Mon Apr 23, 2018 [[Class meeting for 10-405 Parameter Servers|Parameter Servers]]. Parameter servers, PS vs Hadoop, State Synchronous Parallel (SSP) model, Managed Communication in PS, LDA Sampler with PS<br />
* Wed Apr 25, 2018 [[Class meeting for 10-405 Unsupervised Learning On Graphs|Unsupervised Learning On Graphs]]. Spectral clustering, Power iteration clustering, Label propagation for clustering non-graph data, Label propagation for SSL on non-graph data<br />
** '''Last assignment due'''<br />
* Mon Apr 30, 2018 [[Class meeting for 10-405 Review session for final|Review session for final]]. <br />
* Wed May 2, 2018 [[Class meeting for 10-405 Final Exam|Final Exam]].</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_SSL_on_GraphsClass meeting for 10-405 SSL on Graphs2018-03-30T14:50:02Z<p>Wcohen: /* Readings */</p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Slides ===<br />
<br />
* [http://www.cs.cmu.edu/~wcohen/10-405/ssl-on-graphs.pptx Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/ssl-on-graphs.pdf PDF].<br />
<br />
=== Quiz ===<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/92 Today's quiz]<br />
<br />
=== Readings ===<br />
<br />
* William's [http://www.cs.cmu.edu/~wcohen/10-605/notes/graph-ssl.pdf lecture notes on graph-based SSL].<br />
<br />
=== Optional Readings ===<br />
<br />
* [http://www.cs.cmu.edu/~wcohen/postscript/asonam2010-final.pdf Frank Lin and William W. Cohen (2010)]: Semi-Supervised Classification of Network Data Using Very Few Labels in ASONAM-2010.<br />
* [https://server1.tepper.cmu.edu/seminars/docs/BinderPartha.pdf PP Talukdar, K Crammer (2009):] New regularized algorithms for transductive learning Machine Learning and Knowledge Discovery in Databases, 442-457<br />
* [http://www.cs.cmu.edu/~wcohen/postscript/ai-stats-2014.pdf Partha Pratim Talukdar and William W. Cohen (2014)]: Scaling Graph-based Semi Supervised Learning to Large Number of Labels Using Count-Min Sketch in AI-Stats 2014.<br />
* Sujith Ravi and Qiming Diao. "Large Scale Distributed Semi-Supervised Learning Using Streaming Approximation." arXiv preprint arXiv:1512.01752 (2015).<br />
<br />
=== Key things to remember ===<br />
<br />
* What SSL is and when it is useful.<br />
* The harmonic fields and multi-rank walk SSL algorithms, and properties of these algorithms.<br />
* What is optimized by the MAD algorithm, and what the goal is of the various terms in the optimization.<br />
* The power iteration clustering algorithm.</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_SSL_on_GraphsClass meeting for 10-405 SSL on Graphs2018-03-30T14:49:03Z<p>Wcohen: /* Optional Readings */</p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Slides ===<br />
<br />
* [http://www.cs.cmu.edu/~wcohen/10-405/ssl-on-graphs.pptx Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/ssl-on-graphs.pdf PDF].<br />
<br />
=== Quiz ===<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/92 Today's quiz]<br />
<br />
=== Readings ===<br />
<br />
=== Optional Readings ===<br />
<br />
* [http://www.cs.cmu.edu/~wcohen/postscript/asonam2010-final.pdf Frank Lin and William W. Cohen (2010)]: Semi-Supervised Classification of Network Data Using Very Few Labels in ASONAM-2010.<br />
* [https://server1.tepper.cmu.edu/seminars/docs/BinderPartha.pdf PP Talukdar, K Crammer (2009):] New regularized algorithms for transductive learning Machine Learning and Knowledge Discovery in Databases, 442-457<br />
* [http://www.cs.cmu.edu/~wcohen/postscript/ai-stats-2014.pdf Partha Pratim Talukdar and William W. Cohen (2014)]: Scaling Graph-based Semi Supervised Learning to Large Number of Labels Using Count-Min Sketch in AI-Stats 2014.<br />
* Sujith Ravi and Qiming Diao. "Large Scale Distributed Semi-Supervised Learning Using Streaming Approximation." arXiv preprint arXiv:1512.01752 (2015).<br />
<br />
=== Key things to remember ===<br />
<br />
* What SSL is and when it is useful.<br />
* The harmonic fields and multi-rank walk SSL algorithms, and properties of these algorithms.<br />
* What is optimized by the MAD algorithm, and what the goal is of the various terms in the optimization.<br />
* The power iteration clustering algorithm.</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_Deep_LearningClass meeting for 10-405 Deep Learning2018-03-26T16:04:59Z<p>Wcohen: /* Quizzes */</p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Slides ===<br />
<br />
* Lecture 1: [http://www.cs.cmu.edu/~wcohen/10-405/deep-1.pptx Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/deep-1.pdf PDF].<br />
<br />
* Lecture 2: [http://www.cs.cmu.edu/~wcohen/10-405/deep-2.pptx Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/deep-2.pdf PDF] <br />
<br />
* Lecture 3: [http://www.cs.cmu.edu/~wcohen/10-405/deep-3.pptx Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/deep-3.pdf PDF] (draft)<br />
<br />
=== Quizzes ===<br />
<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/79 Quiz for lecture 1]<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/246 Quiz for lecture 2]<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/212 Quiz for lecture 3]<br />
<br />
=== Sample code ===<br />
<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/code/xman.py Expression manager]<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/code/sample-use-of-xman.py Sample use of the expression manager]<br />
<br />
=== Readings ===<br />
<br />
* Automatic differentiation:<br />
** William's notes on [http://www.cs.cmu.edu/~wcohen/10-605/notes/autodiff.pdf automatic differentiation], and the Python code for a simple [http://www.cs.cmu.edu/~wcohen/10-605/code/xman.py Wengart list generator] and a [http://www.cs.cmu.edu/~wcohen/10-605/code/sample-use-of-xman.py sample use of a one].<br />
** [https://justindomke.wordpress.com/2009/03/24/a-simple-explanation-of-reverse-mode-automatic-differentiation/ Domke's blog post] - clear but not much detail - and [http://colah.github.io/posts/2015-08-Backprop/ another nice blog post].<br />
** The clearest paper I've found is [http://www.bcl.hamilton.ie/~barak/papers/toplas-reverse.pdf Reverse-Mode AD in a Functional Framework: Lambda the Ultimate Backpropagator]<br />
<br />
* More general neural networks: <br />
** [http://neuralnetworksanddeeplearning.com/index.html Neural Networks and Deep Learning] An online book by Michael Nielsen, pitched at an appropriate level for 10-601, which has a bunch of exercises and on-line sample programs in Python.<br />
** For much much more detail, look at [http://www.deeplearningbook.org/ the MIT Press book (in preparation) from Bengio] - it's very complete but also fairly technical.<br />
<br />
=== Things to remember ===<br />
* The underlying reasons deep networks are hard to train<br />
* Exploding/vanishing gradients<br />
* Saturation<br />
* The importance of key recent advances in neural networks:<br />
* Matrix operations and GPU training<br />
* ReLU, cross-entropy, softmax<br />
* How backprop can be generalized to a sequence of assignment operations (autodiff)<br />
** Wengert lists<br />
** How to evaluate and differentiate a Wengert list<br />
* Common architectures<br />
** Multi-layer perceptron<br />
** Recursive NNs (RNNS) and Long/short term memory networks (LSTMs)<br />
** Convolutional Networks (CNNs)</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_Deep_LearningClass meeting for 10-405 Deep Learning2018-03-22T01:02:20Z<p>Wcohen: /* Quizzes */</p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Slides ===<br />
<br />
* Lecture 1: [http://www.cs.cmu.edu/~wcohen/10-405/deep-1.pptx Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/deep-1.pdf PDF].<br />
<br />
* Lecture 2: [http://www.cs.cmu.edu/~wcohen/10-405/deep-2.pptx Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/deep-2.pdf PDF] <br />
<br />
* Lecture 3: [http://www.cs.cmu.edu/~wcohen/10-405/deep-3.pptx Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/deep-3.pdf PDF] (draft)<br />
<br />
=== Quizzes ===<br />
<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/79 Quiz for lecture 1]<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/246 Quiz for lecture 2]<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/212 Quiz for lecture 3] - draft<br />
<br />
=== Sample code ===<br />
<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/code/xman.py Expression manager]<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/code/sample-use-of-xman.py Sample use of the expression manager]<br />
<br />
=== Readings ===<br />
<br />
* Automatic differentiation:<br />
** William's notes on [http://www.cs.cmu.edu/~wcohen/10-605/notes/autodiff.pdf automatic differentiation], and the Python code for a simple [http://www.cs.cmu.edu/~wcohen/10-605/code/xman.py Wengart list generator] and a [http://www.cs.cmu.edu/~wcohen/10-605/code/sample-use-of-xman.py sample use of a one].<br />
** [https://justindomke.wordpress.com/2009/03/24/a-simple-explanation-of-reverse-mode-automatic-differentiation/ Domke's blog post] - clear but not much detail - and [http://colah.github.io/posts/2015-08-Backprop/ another nice blog post].<br />
** The clearest paper I've found is [http://www.bcl.hamilton.ie/~barak/papers/toplas-reverse.pdf Reverse-Mode AD in a Functional Framework: Lambda the Ultimate Backpropagator]<br />
<br />
* More general neural networks: <br />
** [http://neuralnetworksanddeeplearning.com/index.html Neural Networks and Deep Learning] An online book by Michael Nielsen, pitched at an appropriate level for 10-601, which has a bunch of exercises and on-line sample programs in Python.<br />
** For much much more detail, look at [http://www.deeplearningbook.org/ the MIT Press book (in preparation) from Bengio] - it's very complete but also fairly technical.<br />
<br />
=== Things to remember ===<br />
* The underlying reasons deep networks are hard to train<br />
* Exploding/vanishing gradients<br />
* Saturation<br />
* The importance of key recent advances in neural networks:<br />
* Matrix operations and GPU training<br />
* ReLU, cross-entropy, softmax<br />
* How backprop can be generalized to a sequence of assignment operations (autodiff)<br />
** Wengert lists<br />
** How to evaluate and differentiate a Wengert list<br />
* Common architectures<br />
** Multi-layer perceptron<br />
** Recursive NNs (RNNS) and Long/short term memory networks (LSTMs)<br />
** Convolutional Networks (CNNs)</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_Deep_LearningClass meeting for 10-405 Deep Learning2018-03-22T01:02:04Z<p>Wcohen: /* Slides */</p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Slides ===<br />
<br />
* Lecture 1: [http://www.cs.cmu.edu/~wcohen/10-405/deep-1.pptx Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/deep-1.pdf PDF].<br />
<br />
* Lecture 2: [http://www.cs.cmu.edu/~wcohen/10-405/deep-2.pptx Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/deep-2.pdf PDF] <br />
<br />
* Lecture 3: [http://www.cs.cmu.edu/~wcohen/10-405/deep-3.pptx Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/deep-3.pdf PDF] (draft)<br />
<br />
=== Quizzes ===<br />
<br />
These are not updated yet --[[User:Wcohen|Wcohen]] ([[User talk:Wcohen|talk]]) 14:58, 19 March 2018 (EDT)<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/79 Quiz for lecture 1]<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/246 Quiz for lecture 2]<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/212 Quiz for lecture 3] - draft<br />
<br />
=== Sample code ===<br />
<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/code/xman.py Expression manager]<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/code/sample-use-of-xman.py Sample use of the expression manager]<br />
<br />
=== Readings ===<br />
<br />
* Automatic differentiation:<br />
** William's notes on [http://www.cs.cmu.edu/~wcohen/10-605/notes/autodiff.pdf automatic differentiation], and the Python code for a simple [http://www.cs.cmu.edu/~wcohen/10-605/code/xman.py Wengart list generator] and a [http://www.cs.cmu.edu/~wcohen/10-605/code/sample-use-of-xman.py sample use of a one].<br />
** [https://justindomke.wordpress.com/2009/03/24/a-simple-explanation-of-reverse-mode-automatic-differentiation/ Domke's blog post] - clear but not much detail - and [http://colah.github.io/posts/2015-08-Backprop/ another nice blog post].<br />
** The clearest paper I've found is [http://www.bcl.hamilton.ie/~barak/papers/toplas-reverse.pdf Reverse-Mode AD in a Functional Framework: Lambda the Ultimate Backpropagator]<br />
<br />
* More general neural networks: <br />
** [http://neuralnetworksanddeeplearning.com/index.html Neural Networks and Deep Learning] An online book by Michael Nielsen, pitched at an appropriate level for 10-601, which has a bunch of exercises and on-line sample programs in Python.<br />
** For much much more detail, look at [http://www.deeplearningbook.org/ the MIT Press book (in preparation) from Bengio] - it's very complete but also fairly technical.<br />
<br />
=== Things to remember ===<br />
* The underlying reasons deep networks are hard to train<br />
* Exploding/vanishing gradients<br />
* Saturation<br />
* The importance of key recent advances in neural networks:<br />
* Matrix operations and GPU training<br />
* ReLU, cross-entropy, softmax<br />
* How backprop can be generalized to a sequence of assignment operations (autodiff)<br />
** Wengert lists<br />
** How to evaluate and differentiate a Wengert list<br />
* Common architectures<br />
** Multi-layer perceptron<br />
** Recursive NNs (RNNS) and Long/short term memory networks (LSTMs)<br />
** Convolutional Networks (CNNs)</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Syllabus_for_Machine_Learning_with_Large_Datasets_10-405_in_Spring_2018Syllabus for Machine Learning with Large Datasets 10-405 in Spring 20182018-03-20T19:16:53Z<p>Wcohen: /* Schedule */</p>
<hr />
<div>This is the syllabus for [[Machine Learning with Large Datasets 10-405 in Spring 2018]]. <br />
<br />
== Ideas for open-ended extensions to the HW assignments ==<br />
<br />
This is not a complete list! you can use any of these as a starting point, but feel free to think up your own extensions.<br />
<br />
HW2 (NB in GuineaPig):<br />
<br />
* The assignment proposes one particular scheme for parallelizing the training/testing algorithm. Consider another parallelization algorithm.<br />
* Implement a similarly scalable Rocchio algorithm and compare it with NB.<br />
* Reimplement the same algorithm in Spark (or some other dataflow language) and compare.<br />
* One or the extensions to GuineaPig not discussed in class is an in-memory map-reduce system. Design an experiment that makes use of this constructively.<br />
<br />
HW3 (Logistic regression and SGD)<br />
* Evaluate the hash trick for Naive Bayes systematically on a series of datasets.<br />
* Implement a parameter-mixing version of logistic regression and evaluate it.<br />
* A [https://www.aclweb.org/anthology/P12-2018 recent paper] proposes (roughly) using SVM with NB-transformed features. Implement this and compare.<br />
* The personalization method described in class is based on [https://www.umiacs.umd.edu/~hal/docs/daume07easyadapt.pdf a transfer learning method] which works similarly. Many wikipedia pages are available in multiple languages, and works in related languages tend to be lexically similar (eg, "astrónomo" is Spanish for "astronomer"). Suppose features were character n-grams (eg "astr", "stro", "tron", ...) - does domain transfer work for the task of classifying wikipedia pages? Construct a dataset and experiment to test this hypothesis.<br />
<br />
=== Notes ===<br />
<br />
* Homeworks, unless otherwise posted, will be due when the next HW comes out.<br />
* Lecture notes and/or slides will be (re)posted around the time of the lectures.<br />
<br />
=== Schedule ===<br />
<br />
* Wed Jan 17, 2018 [[Class meeting for 10-405 Overview|Overview]]. Grading policies and etc, History of Big Data, Complexity theory and cost of important operations<br />
* Mon Jan 22, 2018 [[Class meeting for 10-405 Probability Review|Probability Review]]. Counting for big data and density estimation, streaming Naive Bayes, Rocchio and TFIDF<br />
** '''Start work on''' Assignment 1a: Streaming NB; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw1a.pdf <br />
* Wed Jan 24, 2018 [[Class meeting for 10-405 Streaming Naive Bayes|Streaming Naive Bayes]]. Notes on scalable naive bayes, Alternatives to stream and sort, Local counting in stream and sort, Stream and sort examples<br />
* Mon Jan 29, 2018 [[Class meeting for 10-405 Hadoop Overview|Hadoop Overview]]. Intro to Hadoop, Hadoop Streaming, Debugging Hadoop, Combiners<br />
** '''Start work on''' Assignment 1b: Streaming NB on Hadoop; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw1b.pdf <br />
* Wed Jan 31, 2018 [[Class meeting for 10-405 Workflows For Hadoop|Workflows For Hadoop 1]]. Scalable classification, Abstracts for map-reduce algorithms, Joins in Hadoop<br />
* Mon Feb 5, 2018 [[Class meeting for 10-405 Workflows For Hadoop|Workflows For Hadoop 2]]. Guinea Pig intro, Similarity joins, Similarity joins with TFIDF, Parallel simjoins<br />
** '''Start work on''' Assignment 2a: Naive bayes training in Guinea Pig; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw2a.pdf <br />
* Wed Feb 7, 2018 [[Class meeting for 10-405 Workflows For Hadoop|Workflows For Hadoop 3]]. PageRank, PageRank in Pig and Guinea Pig, K-means in Pig, Spark, Systems built on top of Hadoop<br />
* Mon Feb 12, 2018 [[Class meeting for 10-405 SGD and Hash Kernels|SGD and Hash Kernels]]. Learning as optimization, Logistic regression with SGD, Regularized SGD, Efficient regularized SGD, Hash kernels for logistic regression<br />
** '''Start work on''' Assignment 2b: Naive bayes testing in Guinea Pig. http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw2b.pdf <br />
* Wed Feb 14, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 1]]. The "delta trick", Averaged perceptrons, Debugging ML algorithms<br />
* Mon Feb 19, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 2]]. Hash kernels, Ranking perceptrons, Structured perceptrons<br />
** '''Start work on''' Assignment 3: scalable SGD; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw3.pdf<br />
* Wed Feb 21, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 3]]. Iterative parameter mixing paper, Parallel SGD via Param Mixing<br />
* Mon Feb 26, 2018 [[Class meeting for 10-405 SGD for MF|SGD for MF]]. Matrix factorization, Matrix factorization with SGD, distributed matrix factorization with SGD<br />
* Wed Feb 28, 2018 [[Class meeting for 10-405 Guest lecture - tentative|Guest lecture]] - [http://www.cs.cmu.edu/~kijungs/ Kijung Shin]<br />
* Mon Mar 5, 2018 [[Class meeting for 10-405 Midterm review and catchup|Midterm review and catchup]]. Midterm review<br />
** '''Previous assignment due'''<br />
* Wed Mar 7, 2018 [[Class meeting for 10-405 Midterm|Midterm]]. <br />
* Mon Mar 19, 2018 [[Class meeting for 10-405 Deep Learning|Deep Learning 1]]. BackProp following Nielson, Deep learning and GPUs, Reverse-mode differentiation (autodiff)<br />
* Wed Mar 21, 2018 [[Class meeting for 10-405 Deep Learning|Deep Learning 2]]. Reverse-mode differentiation (autodiff), Some systems using autodiff, Details on Wengert lists, Breakdown of xman.py, Inputs, parameters, updates<br />
** '''Start work on''' Assignment 4: Autodiff with IPM part 1/2; Draft at http://www.cs.cmu.edu/~wcohen/10-605/assignments/2017-fall/hw-4+5-autodiff/hw4/main.pdf<br />
* Mon Mar 26, 2018 [[Class meeting for 10-405 Deep Learning|Deep Learning 3]]. Expressiveness of MLPs, Exploding and vanishing gradients, Modern deep learning models, Word2vec and GloVE, Recursive ANNs, Convolutional ANNs, Achitectures using RNNs<br />
* Wed Mar 28, 2018 [[Class meeting for 10-405 Computing with GPUs|Computing with GPUs]]. Introduction to GPUs, CUDA<br />
* Fri Mar 30, 2018<br />
** '''Start work on''' Assignment 5: Autodiff with IPM part 2/2<br />
* Mon Apr 2, 2018 [[Class meeting for 10-405 Randomized Algorithms|Randomized Algorithms 1]]. Bloom filters, The countmin sketch, CM Sketches in Deep Learning<br />
** '''HW 4 is due'''<br />
* Wed Apr 4, 2018 [[Class meeting for 10-405 Randomized Algorithms|Randomized Algorithms 2]]. Review of Bloom filters, Locality sensitive hashing, Online LSH<br />
* Mon Apr 9, 2018 [[Class meeting for 10-405 Graph Architectures for ML|Graph Architectures for ML]]. Graph-based ML architectures, Pregel, Signal-collect, GraphLab, PowerGraph, GraphChi, GraphX<br />
* Wed Apr 11, 2018 [[Class meeting for 10-405 SSL on Graphs|SSL on Graphs]]. Semi-supervised learning intro, Multirank-walk SSL method, Harmonic fields, Modified Adsorption SSL method, MAD with countmin sketches<br />
** '''Start work on''' Assignment 6: SSL on a graph in Spark maybe using NELL data?<br />
* Mon Apr 16, 2018 [[Class meeting for 10-405 LDA|LDA 1]]. DGMs for naive Bayes, Gibbs sampling for LDA<br />
* Wed Apr 18, 2018 [[Class meeting for 10-405 LDA|LDA 2]]. Parallelizing LDA, Fast sampling for LDA, DGMs for graphs<br />
* Mon Apr 23, 2018 [[Class meeting for 10-405 Parameter Servers|Parameter Servers]]. Parameter servers, PS vs Hadoop, State Synchronous Parallel (SSP) model, Managed Communication in PS, LDA Sampler with PS<br />
* Wed Apr 25, 2018 [[Class meeting for 10-405 Unsupervised Learning On Graphs|Unsupervised Learning On Graphs]]. Spectral clustering, Power iteration clustering, Label propagation for clustering non-graph data, Label propagation for SSL on non-graph data<br />
** '''Last assignment due'''<br />
* Mon Apr 30, 2018 [[Class meeting for 10-405 Review session for final|Review session for final]]. <br />
* Wed May 2, 2018 [[Class meeting for 10-405 Final Exam|Final Exam]].</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_Deep_LearningClass meeting for 10-405 Deep Learning2018-03-19T22:17:43Z<p>Wcohen: /* Quizzes */</p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Slides ===<br />
<br />
* Lecture 1: [http://www.cs.cmu.edu/~wcohen/10-405/deep-1.pptx Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/deep-1.pdf PDF].<br />
<br />
* Lecture 2: [http://www.cs.cmu.edu/~wcohen/10-405/deep-2.pptx Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/deep-2.pdf PDF] (draft)<br />
<br />
* Lecture 3: [http://www.cs.cmu.edu/~wcohen/10-405/deep-3.pptx Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/deep-3.pdf PDF] (draft)<br />
<br />
=== Quizzes ===<br />
<br />
These are not updated yet --[[User:Wcohen|Wcohen]] ([[User talk:Wcohen|talk]]) 14:58, 19 March 2018 (EDT)<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/79 Quiz for lecture 1]<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/246 Quiz for lecture 2]<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/212 Quiz for lecture 3] - draft<br />
<br />
=== Sample code ===<br />
<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/code/xman.py Expression manager]<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/code/sample-use-of-xman.py Sample use of the expression manager]<br />
<br />
=== Readings ===<br />
<br />
* Automatic differentiation:<br />
** William's notes on [http://www.cs.cmu.edu/~wcohen/10-605/notes/autodiff.pdf automatic differentiation], and the Python code for a simple [http://www.cs.cmu.edu/~wcohen/10-605/code/xman.py Wengart list generator] and a [http://www.cs.cmu.edu/~wcohen/10-605/code/sample-use-of-xman.py sample use of a one].<br />
** [https://justindomke.wordpress.com/2009/03/24/a-simple-explanation-of-reverse-mode-automatic-differentiation/ Domke's blog post] - clear but not much detail - and [http://colah.github.io/posts/2015-08-Backprop/ another nice blog post].<br />
** The clearest paper I've found is [http://www.bcl.hamilton.ie/~barak/papers/toplas-reverse.pdf Reverse-Mode AD in a Functional Framework: Lambda the Ultimate Backpropagator]<br />
<br />
* More general neural networks: <br />
** [http://neuralnetworksanddeeplearning.com/index.html Neural Networks and Deep Learning] An online book by Michael Nielsen, pitched at an appropriate level for 10-601, which has a bunch of exercises and on-line sample programs in Python.<br />
** For much much more detail, look at [http://www.deeplearningbook.org/ the MIT Press book (in preparation) from Bengio] - it's very complete but also fairly technical.<br />
<br />
=== Things to remember ===<br />
* The underlying reasons deep networks are hard to train<br />
* Exploding/vanishing gradients<br />
* Saturation<br />
* The importance of key recent advances in neural networks:<br />
* Matrix operations and GPU training<br />
* ReLU, cross-entropy, softmax<br />
* How backprop can be generalized to a sequence of assignment operations (autodiff)<br />
** Wengert lists<br />
** How to evaluate and differentiate a Wengert list<br />
* Common architectures<br />
** Multi-layer perceptron<br />
** Recursive NNs (RNNS) and Long/short term memory networks (LSTMs)<br />
** Convolutional Networks (CNNs)</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_Deep_LearningClass meeting for 10-405 Deep Learning2018-03-19T21:43:56Z<p>Wcohen: /* Slides */</p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Slides ===<br />
<br />
* Lecture 1: [http://www.cs.cmu.edu/~wcohen/10-405/deep-1.pptx Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/deep-1.pdf PDF].<br />
<br />
* Lecture 2: [http://www.cs.cmu.edu/~wcohen/10-405/deep-2.pptx Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/deep-2.pdf PDF] (draft)<br />
<br />
* Lecture 3: [http://www.cs.cmu.edu/~wcohen/10-405/deep-3.pptx Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/deep-3.pdf PDF] (draft)<br />
<br />
=== Quizzes ===<br />
<br />
These are not updated yet --[[User:Wcohen|Wcohen]] ([[User talk:Wcohen|talk]]) 14:58, 19 March 2018 (EDT)<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/79 Quiz for lecture 1]<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/212 Quiz for lecture 3] - draft<br />
<br />
=== Sample code ===<br />
<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/code/xman.py Expression manager]<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/code/sample-use-of-xman.py Sample use of the expression manager]<br />
<br />
=== Readings ===<br />
<br />
* Automatic differentiation:<br />
** William's notes on [http://www.cs.cmu.edu/~wcohen/10-605/notes/autodiff.pdf automatic differentiation], and the Python code for a simple [http://www.cs.cmu.edu/~wcohen/10-605/code/xman.py Wengart list generator] and a [http://www.cs.cmu.edu/~wcohen/10-605/code/sample-use-of-xman.py sample use of a one].<br />
** [https://justindomke.wordpress.com/2009/03/24/a-simple-explanation-of-reverse-mode-automatic-differentiation/ Domke's blog post] - clear but not much detail - and [http://colah.github.io/posts/2015-08-Backprop/ another nice blog post].<br />
** The clearest paper I've found is [http://www.bcl.hamilton.ie/~barak/papers/toplas-reverse.pdf Reverse-Mode AD in a Functional Framework: Lambda the Ultimate Backpropagator]<br />
<br />
* More general neural networks: <br />
** [http://neuralnetworksanddeeplearning.com/index.html Neural Networks and Deep Learning] An online book by Michael Nielsen, pitched at an appropriate level for 10-601, which has a bunch of exercises and on-line sample programs in Python.<br />
** For much much more detail, look at [http://www.deeplearningbook.org/ the MIT Press book (in preparation) from Bengio] - it's very complete but also fairly technical.<br />
<br />
=== Things to remember ===<br />
* The underlying reasons deep networks are hard to train<br />
* Exploding/vanishing gradients<br />
* Saturation<br />
* The importance of key recent advances in neural networks:<br />
* Matrix operations and GPU training<br />
* ReLU, cross-entropy, softmax<br />
* How backprop can be generalized to a sequence of assignment operations (autodiff)<br />
** Wengert lists<br />
** How to evaluate and differentiate a Wengert list<br />
* Common architectures<br />
** Multi-layer perceptron<br />
** Recursive NNs (RNNS) and Long/short term memory networks (LSTMs)<br />
** Convolutional Networks (CNNs)</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_Deep_LearningClass meeting for 10-405 Deep Learning2018-03-19T21:43:25Z<p>Wcohen: /* Quizzes */</p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Slides ===<br />
<br />
* Lecture 1: [http://www.cs.cmu.edu/~wcohen/10-405/deep-1.pptx Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/deep-1.pdf PDF].<br />
<br />
* Lecture 2: [http://www.cs.cmu.edu/~wcohen/10-405/deep-2.pptx Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/deep-2.pdf PDF].<br />
<br />
* Lecture 3: [http://www.cs.cmu.edu/~wcohen/10-405/deep-3.pptx Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/deep-3.pdf PDF].<br />
<br />
=== Quizzes ===<br />
<br />
These are not updated yet --[[User:Wcohen|Wcohen]] ([[User talk:Wcohen|talk]]) 14:58, 19 March 2018 (EDT)<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/79 Quiz for lecture 1]<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/212 Quiz for lecture 3] - draft<br />
<br />
=== Sample code ===<br />
<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/code/xman.py Expression manager]<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/code/sample-use-of-xman.py Sample use of the expression manager]<br />
<br />
=== Readings ===<br />
<br />
* Automatic differentiation:<br />
** William's notes on [http://www.cs.cmu.edu/~wcohen/10-605/notes/autodiff.pdf automatic differentiation], and the Python code for a simple [http://www.cs.cmu.edu/~wcohen/10-605/code/xman.py Wengart list generator] and a [http://www.cs.cmu.edu/~wcohen/10-605/code/sample-use-of-xman.py sample use of a one].<br />
** [https://justindomke.wordpress.com/2009/03/24/a-simple-explanation-of-reverse-mode-automatic-differentiation/ Domke's blog post] - clear but not much detail - and [http://colah.github.io/posts/2015-08-Backprop/ another nice blog post].<br />
** The clearest paper I've found is [http://www.bcl.hamilton.ie/~barak/papers/toplas-reverse.pdf Reverse-Mode AD in a Functional Framework: Lambda the Ultimate Backpropagator]<br />
<br />
* More general neural networks: <br />
** [http://neuralnetworksanddeeplearning.com/index.html Neural Networks and Deep Learning] An online book by Michael Nielsen, pitched at an appropriate level for 10-601, which has a bunch of exercises and on-line sample programs in Python.<br />
** For much much more detail, look at [http://www.deeplearningbook.org/ the MIT Press book (in preparation) from Bengio] - it's very complete but also fairly technical.<br />
<br />
=== Things to remember ===<br />
* The underlying reasons deep networks are hard to train<br />
* Exploding/vanishing gradients<br />
* Saturation<br />
* The importance of key recent advances in neural networks:<br />
* Matrix operations and GPU training<br />
* ReLU, cross-entropy, softmax<br />
* How backprop can be generalized to a sequence of assignment operations (autodiff)<br />
** Wengert lists<br />
** How to evaluate and differentiate a Wengert list<br />
* Common architectures<br />
** Multi-layer perceptron<br />
** Recursive NNs (RNNS) and Long/short term memory networks (LSTMs)<br />
** Convolutional Networks (CNNs)</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Syllabus_for_Machine_Learning_with_Large_Datasets_10-405_in_Spring_2018Syllabus for Machine Learning with Large Datasets 10-405 in Spring 20182018-03-19T21:31:57Z<p>Wcohen: /* Schedule */</p>
<hr />
<div>This is the syllabus for [[Machine Learning with Large Datasets 10-405 in Spring 2018]]. <br />
<br />
== Ideas for open-ended extensions to the HW assignments ==<br />
<br />
This is not a complete list! you can use any of these as a starting point, but feel free to think up your own extensions.<br />
<br />
HW2 (NB in GuineaPig):<br />
<br />
* The assignment proposes one particular scheme for parallelizing the training/testing algorithm. Consider another parallelization algorithm.<br />
* Implement a similarly scalable Rocchio algorithm and compare it with NB.<br />
* Reimplement the same algorithm in Spark (or some other dataflow language) and compare.<br />
* One or the extensions to GuineaPig not discussed in class is an in-memory map-reduce system. Design an experiment that makes use of this constructively.<br />
<br />
HW3 (Logistic regression and SGD)<br />
* Evaluate the hash trick for Naive Bayes systematically on a series of datasets.<br />
* Implement a parameter-mixing version of logistic regression and evaluate it.<br />
* A [https://www.aclweb.org/anthology/P12-2018 recent paper] proposes (roughly) using SVM with NB-transformed features. Implement this and compare.<br />
* The personalization method described in class is based on [https://www.umiacs.umd.edu/~hal/docs/daume07easyadapt.pdf a transfer learning method] which works similarly. Many wikipedia pages are available in multiple languages, and works in related languages tend to be lexically similar (eg, "astrónomo" is Spanish for "astronomer"). Suppose features were character n-grams (eg "astr", "stro", "tron", ...) - does domain transfer work for the task of classifying wikipedia pages? Construct a dataset and experiment to test this hypothesis.<br />
<br />
=== Notes ===<br />
<br />
* Homeworks, unless otherwise posted, will be due when the next HW comes out.<br />
* Lecture notes and/or slides will be (re)posted around the time of the lectures.<br />
<br />
=== Schedule ===<br />
<br />
* Wed Jan 17, 2018 [[Class meeting for 10-405 Overview|Overview]]. Grading policies and etc, History of Big Data, Complexity theory and cost of important operations<br />
* Mon Jan 22, 2018 [[Class meeting for 10-405 Probability Review|Probability Review]]. Counting for big data and density estimation, streaming Naive Bayes, Rocchio and TFIDF<br />
** '''Start work on''' Assignment 1a: Streaming NB; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw1a.pdf <br />
* Wed Jan 24, 2018 [[Class meeting for 10-405 Streaming Naive Bayes|Streaming Naive Bayes]]. Notes on scalable naive bayes, Alternatives to stream and sort, Local counting in stream and sort, Stream and sort examples<br />
* Mon Jan 29, 2018 [[Class meeting for 10-405 Hadoop Overview|Hadoop Overview]]. Intro to Hadoop, Hadoop Streaming, Debugging Hadoop, Combiners<br />
** '''Start work on''' Assignment 1b: Streaming NB on Hadoop; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw1b.pdf <br />
* Wed Jan 31, 2018 [[Class meeting for 10-405 Workflows For Hadoop|Workflows For Hadoop 1]]. Scalable classification, Abstracts for map-reduce algorithms, Joins in Hadoop<br />
* Mon Feb 5, 2018 [[Class meeting for 10-405 Workflows For Hadoop|Workflows For Hadoop 2]]. Guinea Pig intro, Similarity joins, Similarity joins with TFIDF, Parallel simjoins<br />
** '''Start work on''' Assignment 2a: Naive bayes training in Guinea Pig; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw2a.pdf <br />
* Wed Feb 7, 2018 [[Class meeting for 10-405 Workflows For Hadoop|Workflows For Hadoop 3]]. PageRank, PageRank in Pig and Guinea Pig, K-means in Pig, Spark, Systems built on top of Hadoop<br />
* Mon Feb 12, 2018 [[Class meeting for 10-405 SGD and Hash Kernels|SGD and Hash Kernels]]. Learning as optimization, Logistic regression with SGD, Regularized SGD, Efficient regularized SGD, Hash kernels for logistic regression<br />
** '''Start work on''' Assignment 2b: Naive bayes testing in Guinea Pig. http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw2b.pdf <br />
* Wed Feb 14, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 1]]. The "delta trick", Averaged perceptrons, Debugging ML algorithms<br />
* Mon Feb 19, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 2]]. Hash kernels, Ranking perceptrons, Structured perceptrons<br />
** '''Start work on''' Assignment 3: scalable SGD; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw3.pdf<br />
* Wed Feb 21, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 3]]. Iterative parameter mixing paper, Parallel SGD via Param Mixing<br />
* Mon Feb 26, 2018 [[Class meeting for 10-405 SGD for MF|SGD for MF]]. Matrix factorization, Matrix factorization with SGD, distributed matrix factorization with SGD<br />
* Wed Feb 28, 2018 [[Class meeting for 10-405 Guest lecture - tentative|Guest lecture]] - [http://www.cs.cmu.edu/~kijungs/ Kijung Shin]<br />
* Mon Mar 5, 2018 [[Class meeting for 10-405 Midterm review and catchup|Midterm review and catchup]]. Midterm review<br />
** '''Previous assignment due'''<br />
* Wed Mar 7, 2018 [[Class meeting for 10-405 Midterm|Midterm]]. <br />
* Mon Mar 19, 2018 [[Class meeting for 10-405 Deep Learning|Deep Learning 1]]. BackProp following Nielson, Deep learning and GPUs, Reverse-mode differentiation (autodiff)<br />
* Wed Mar 21, 2018 [[Class meeting for 10-405 Deep Learning|Deep Learning 2]]. Reverse-mode differentiation (autodiff), Some systems using autodiff, Details on Wengert lists, Breakdown of xman.py, Inputs, parameters, updates<br />
** '''Start work on''' Assignment 4: Autodiff with IPM part 1/2; Draft at http://www.cs.cmu.edu/~wcohen/10-605/assignments/2017-fall/hw-4+5-autodiff/hw4/main.pdf<br />
* Mon Mar 26, 2018 [[Class meeting for 10-405 Deep Learning|Deep Learning 3]]. Expressiveness of MLPs, Exploding and vanishing gradients, Modern deep learning models, Word2vec and GloVE, Recursive ANNs, Convolutional ANNs, Achitectures using RNNs<br />
* Wed Mar 28, 2018 [[Class meeting for 10-405 Computing with GPUs|Computing with GPUs]]. Introduction to GPUs, CUDA<br />
* Mon Apr 2, 2018 [[Class meeting for 10-405 Randomized Algorithms|Randomized Algorithms 1]]. Bloom filters, The countmin sketch, CM Sketches in Deep Learning<br />
* Wed Apr 4, 2018 [[Class meeting for 10-405 Randomized Algorithms|Randomized Algorithms 2]]. Review of Bloom filters, Locality sensitive hashing, Online LSH<br />
** '''Start work on''' Assignment 5: Autodiff with IPM part 2/2<br />
* Mon Apr 9, 2018 [[Class meeting for 10-405 Graph Architectures for ML|Graph Architectures for ML]]. Graph-based ML architectures, Pregel, Signal-collect, GraphLab, PowerGraph, GraphChi, GraphX<br />
* Wed Apr 11, 2018 [[Class meeting for 10-405 SSL on Graphs|SSL on Graphs]]. Semi-supervised learning intro, Multirank-walk SSL method, Harmonic fields, Modified Adsorption SSL method, MAD with countmin sketches<br />
** '''Start work on''' Assignment 6: SSL on a graph in Spark maybe using NELL data?<br />
* Mon Apr 16, 2018 [[Class meeting for 10-405 LDA|LDA 1]]. DGMs for naive Bayes, Gibbs sampling for LDA<br />
* Wed Apr 18, 2018 [[Class meeting for 10-405 LDA|LDA 2]]. Parallelizing LDA, Fast sampling for LDA, DGMs for graphs<br />
* Mon Apr 23, 2018 [[Class meeting for 10-405 Parameter Servers|Parameter Servers]]. Parameter servers, PS vs Hadoop, State Synchronous Parallel (SSP) model, Managed Communication in PS, LDA Sampler with PS<br />
* Wed Apr 25, 2018 [[Class meeting for 10-405 Unsupervised Learning On Graphs|Unsupervised Learning On Graphs]]. Spectral clustering, Power iteration clustering, Label propagation for clustering non-graph data, Label propagation for SSL on non-graph data<br />
** '''Last assignment due'''<br />
* Mon Apr 30, 2018 [[Class meeting for 10-405 Review session for final|Review session for final]]. <br />
* Wed May 2, 2018 [[Class meeting for 10-405 Final Exam|Final Exam]].</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_Deep_LearningClass meeting for 10-405 Deep Learning2018-03-19T18:58:56Z<p>Wcohen: /* Quizzes */</p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Slides ===<br />
<br />
* Lecture 1: [http://www.cs.cmu.edu/~wcohen/10-405/deep-1.pptx Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/deep-1.pdf PDF].<br />
<br />
* Lecture 2: [http://www.cs.cmu.edu/~wcohen/10-405/deep-2.pptx Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/deep-2.pdf PDF].<br />
<br />
* Lecture 3: [http://www.cs.cmu.edu/~wcohen/10-405/deep-3.pptx Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/deep-3.pdf PDF].<br />
<br />
=== Quizzes ===<br />
<br />
These are not updated yet --[[User:Wcohen|Wcohen]] ([[User talk:Wcohen|talk]]) 14:58, 19 March 2018 (EDT)<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/75 Quiz for lecture 1]<br />
* [https://qna.cs.cmu.edu/#/pages/view/79 Quiz for lecture 2]<br />
* [https://qna.cs.cmu.edu/#/pages/view/212 Quiz for lecture 3]<br />
<br />
=== Sample code ===<br />
<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/code/xman.py Expression manager]<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/code/sample-use-of-xman.py Sample use of the expression manager]<br />
<br />
=== Readings ===<br />
<br />
* Automatic differentiation:<br />
** William's notes on [http://www.cs.cmu.edu/~wcohen/10-605/notes/autodiff.pdf automatic differentiation], and the Python code for a simple [http://www.cs.cmu.edu/~wcohen/10-605/code/xman.py Wengart list generator] and a [http://www.cs.cmu.edu/~wcohen/10-605/code/sample-use-of-xman.py sample use of a one].<br />
** [https://justindomke.wordpress.com/2009/03/24/a-simple-explanation-of-reverse-mode-automatic-differentiation/ Domke's blog post] - clear but not much detail - and [http://colah.github.io/posts/2015-08-Backprop/ another nice blog post].<br />
** The clearest paper I've found is [http://www.bcl.hamilton.ie/~barak/papers/toplas-reverse.pdf Reverse-Mode AD in a Functional Framework: Lambda the Ultimate Backpropagator]<br />
<br />
* More general neural networks: <br />
** [http://neuralnetworksanddeeplearning.com/index.html Neural Networks and Deep Learning] An online book by Michael Nielsen, pitched at an appropriate level for 10-601, which has a bunch of exercises and on-line sample programs in Python.<br />
** For much much more detail, look at [http://www.deeplearningbook.org/ the MIT Press book (in preparation) from Bengio] - it's very complete but also fairly technical.<br />
<br />
=== Things to remember ===<br />
* The underlying reasons deep networks are hard to train<br />
* Exploding/vanishing gradients<br />
* Saturation<br />
* The importance of key recent advances in neural networks:<br />
* Matrix operations and GPU training<br />
* ReLU, cross-entropy, softmax<br />
* How backprop can be generalized to a sequence of assignment operations (autodiff)<br />
** Wengert lists<br />
** How to evaluate and differentiate a Wengert list<br />
* Common architectures<br />
** Multi-layer perceptron<br />
** Recursive NNs (RNNS) and Long/short term memory networks (LSTMs)<br />
** Convolutional Networks (CNNs)</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Syllabus_for_Machine_Learning_with_Large_Datasets_10-405_in_Spring_2018Syllabus for Machine Learning with Large Datasets 10-405 in Spring 20182018-03-12T14:44:07Z<p>Wcohen: /* Ideas for open-ended extensions to the HW assignments */</p>
<hr />
<div>This is the syllabus for [[Machine Learning with Large Datasets 10-405 in Spring 2018]]. <br />
<br />
== Ideas for open-ended extensions to the HW assignments ==<br />
<br />
This is not a complete list! you can use any of these as a starting point, but feel free to think up your own extensions.<br />
<br />
HW2 (NB in GuineaPig):<br />
<br />
* The assignment proposes one particular scheme for parallelizing the training/testing algorithm. Consider another parallelization algorithm.<br />
* Implement a similarly scalable Rocchio algorithm and compare it with NB.<br />
* Reimplement the same algorithm in Spark (or some other dataflow language) and compare.<br />
* One or the extensions to GuineaPig not discussed in class is an in-memory map-reduce system. Design an experiment that makes use of this constructively.<br />
<br />
HW3 (Logistic regression and SGD)<br />
* Evaluate the hash trick for Naive Bayes systematically on a series of datasets.<br />
* Implement a parameter-mixing version of logistic regression and evaluate it.<br />
* A [https://www.aclweb.org/anthology/P12-2018 recent paper] proposes (roughly) using SVM with NB-transformed features. Implement this and compare.<br />
* The personalization method described in class is based on [https://www.umiacs.umd.edu/~hal/docs/daume07easyadapt.pdf a transfer learning method] which works similarly. Many wikipedia pages are available in multiple languages, and works in related languages tend to be lexically similar (eg, "astrónomo" is Spanish for "astronomer"). Suppose features were character n-grams (eg "astr", "stro", "tron", ...) - does domain transfer work for the task of classifying wikipedia pages? Construct a dataset and experiment to test this hypothesis.<br />
<br />
=== Notes ===<br />
<br />
* Homeworks, unless otherwise posted, will be due when the next HW comes out.<br />
* Lecture notes and/or slides will be (re)posted around the time of the lectures.<br />
<br />
=== Schedule ===<br />
<br />
* Wed Jan 17, 2018 [[Class meeting for 10-405 Overview|Overview]]. Grading policies and etc, History of Big Data, Complexity theory and cost of important operations<br />
* Mon Jan 22, 2018 [[Class meeting for 10-405 Probability Review|Probability Review]]. Counting for big data and density estimation, streaming Naive Bayes, Rocchio and TFIDF<br />
** '''Start work on''' Assignment 1a: Streaming NB; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw1a.pdf <br />
* Wed Jan 24, 2018 [[Class meeting for 10-405 Streaming Naive Bayes|Streaming Naive Bayes]]. Notes on scalable naive bayes, Alternatives to stream and sort, Local counting in stream and sort, Stream and sort examples<br />
* Mon Jan 29, 2018 [[Class meeting for 10-405 Hadoop Overview|Hadoop Overview]]. Intro to Hadoop, Hadoop Streaming, Debugging Hadoop, Combiners<br />
** '''Start work on''' Assignment 1b: Streaming NB on Hadoop; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw1b.pdf <br />
* Wed Jan 31, 2018 [[Class meeting for 10-405 Workflows For Hadoop|Workflows For Hadoop 1]]. Scalable classification, Abstracts for map-reduce algorithms, Joins in Hadoop<br />
* Mon Feb 5, 2018 [[Class meeting for 10-405 Workflows For Hadoop|Workflows For Hadoop 2]]. Guinea Pig intro, Similarity joins, Similarity joins with TFIDF, Parallel simjoins<br />
** '''Start work on''' Assignment 2a: Naive bayes training in Guinea Pig; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw2a.pdf <br />
* Wed Feb 7, 2018 [[Class meeting for 10-405 Workflows For Hadoop|Workflows For Hadoop 3]]. PageRank, PageRank in Pig and Guinea Pig, K-means in Pig, Spark, Systems built on top of Hadoop<br />
* Mon Feb 12, 2018 [[Class meeting for 10-405 SGD and Hash Kernels|SGD and Hash Kernels]]. Learning as optimization, Logistic regression with SGD, Regularized SGD, Efficient regularized SGD, Hash kernels for logistic regression<br />
** '''Start work on''' Assignment 2b: Naive bayes testing in Guinea Pig. http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw2b.pdf <br />
* Wed Feb 14, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 1]]. The "delta trick", Averaged perceptrons, Debugging ML algorithms<br />
* Mon Feb 19, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 2]]. Hash kernels, Ranking perceptrons, Structured perceptrons<br />
** '''Start work on''' Assignment 3: scalable SGD; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw3.pdf<br />
* Wed Feb 21, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 3]]. Iterative parameter mixing paper, Parallel SGD via Param Mixing<br />
* Mon Feb 26, 2018 [[Class meeting for 10-405 SGD for MF|SGD for MF]]. Matrix factorization, Matrix factorization with SGD, distributed matrix factorization with SGD<br />
* Wed Feb 28, 2018 [[Class meeting for 10-405 Guest lecture - tentative|Guest lecture]] - [http://www.cs.cmu.edu/~kijungs/ Kijung Shin]<br />
* Mon Mar 5, 2018 [[Class meeting for 10-405 Midterm review and catchup|Midterm review and catchup]]. Midterm review<br />
** '''Previous assignment due'''<br />
* Wed Mar 7, 2018 [[Class meeting for 10-405 Midterm|Midterm]]. <br />
* Mon Mar 19, 2018 [[Class meeting for 10-405 Computing with GPUs|Computing with GPUs]]. Introduction to GPUs, CUDA, Vectorization<br />
** '''Start work on''' Assignment 4: Autodiff with IPM part 1/2; Draft at http://www.cs.cmu.edu/~wcohen/10-405/assignments/2016-fall/hw-5-autodiff/main.pdf<br />
* Wed Mar 21, 2018 [[Class meeting for 10-405 Deep Learning|Deep Learning 1]]. Deep learning intro, BackProp following Nielson, Expressiveness of MLPs, Deep learning and GPUs, Exploding and vanishing gradients, Modern deep learning models<br />
* Mon Mar 26, 2018 [[Class meeting for 10-405 Deep Learning|Deep Learning 2]]. Reverse-mode differentiation (autodiff), Some systems using autodiff, Details on Wengert lists, Breakdown of xman.py<br />
* Wed Mar 28, 2018 [[Class meeting for 10-405 Deep Learning|Deep Learning 3]]. Inputs, parameters, updates, Word2vec and GloVE, Recursive ANNs, Convolutional ANNs, Achitectures using RNNs<br />
* Mon Apr 2, 2018 [[Class meeting for 10-405 Randomized Algorithms|Randomized Algorithms 1]]. Bloom filters, The countmin sketch, CM Sketches in Deep Learning<br />
** '''Start work on''' Assignment 5: Autodiff with IPM part 2/2<br />
* Wed Apr 4, 2018 [[Class meeting for 10-405 Randomized Algorithms|Randomized Algorithms 2]]. Review of Bloom filters, Locality sensitive hashing, Online LSH<br />
* Mon Apr 9, 2018 [[Class meeting for 10-405 Graph Architectures for ML|Graph Architectures for ML]]. Graph-based ML architectures, Pregel, Signal-collect, GraphLab, PowerGraph, GraphChi, GraphX<br />
** '''Start work on''' Assignment 6: SSL in Spark<br />
* Wed Apr 11, 2018 [[Class meeting for 10-405 SSL on Graphs|SSL on Graphs]]. Semi-supervised learning intro, Multirank-walk SSL method, Harmonic fields, Modified Adsorption SSL method, MAD with countmin sketches<br />
* Mon Apr 16, 2018 [[Class meeting for 10-405 LDA|LDA 1]]. DGMs for naive Bayes, Gibbs sampling for LDA<br />
* Wed Apr 18, 2018 [[Class meeting for 10-405 LDA|LDA 2]]. Parallelizing LDA, Fast sampling for LDA, DGMs for graphs<br />
* Mon Apr 23, 2018 [[Class meeting for 10-405 Parameter Servers|Parameter Servers]]. Parameter servers, PS vs Hadoop, State Synchronous Parallel (SSP) model, Managed Communication in PS, LDA Sampler with PS<br />
** '''Previous assignment due'''<br />
* Wed Apr 25, 2018 [[Class meeting for 10-405 Unsupervised Learning On Graphs|Unsupervised Learning On Graphs]]. Spectral clustering, Power iteration clustering, Label propagation for clustering non-graph data, Label propagation for SSL on non-graph data<br />
* Mon Apr 30, 2018 [[Class meeting for 10-405 Review session for final|Review session for final]]. <br />
* Wed May 2, 2018 [[Class meeting for 10-405 Final Exam|Final Exam]].</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Syllabus_for_Machine_Learning_with_Large_Datasets_10-405_in_Spring_2018Syllabus for Machine Learning with Large Datasets 10-405 in Spring 20182018-03-12T14:41:38Z<p>Wcohen: /* Ideas for extensions to the HW assignments */</p>
<hr />
<div>This is the syllabus for [[Machine Learning with Large Datasets 10-405 in Spring 2018]]. <br />
<br />
== Ideas for open-ended extensions to the HW assignments ==<br />
<br />
This is not a complete list! you can use any of these as a starting point, but feel free to think up your own extensions.<br />
<br />
HW2 (NB in GuineaPig):<br />
<br />
* The assignment proposes one particular scheme for parallelizing the training/testing algorithm. Consider another parallelization algorithm.<br />
* Implement a similarly scalable Rocchio algorithm and compare it with NB.<br />
* Reimplement the same algorithm in Spark (or some other dataflow language) and compare.<br />
<br />
HW3 (Logistic regression and SGD)<br />
* Evaluate the hash trick for Naive Bayes systematically on a series of datasets.<br />
* Implement a parameter-mixing version of logistic regression and evaluate it.<br />
* A [https://www.aclweb.org/anthology/P12-2018 recent paper] proposes (roughly) using SVM with NB-transformed features. Implement this and compare.<br />
* The personalization method described in class is based on [https://www.umiacs.umd.edu/~hal/docs/daume07easyadapt.pdf a transfer learning method] which works similarly. Many wikipedia pages are available in multiple languages, and works in related languages tend to be lexically similar (eg, "astrónomo" is Spanish for "astronomer"). Suppose features were character n-grams (eg "astr", "stro", "tron", ...) - does domain transfer work for the task of classifying wikipedia pages? Construct a dataset and experiment to test this hypothesis.<br />
<br />
=== Notes ===<br />
<br />
* Homeworks, unless otherwise posted, will be due when the next HW comes out.<br />
* Lecture notes and/or slides will be (re)posted around the time of the lectures.<br />
<br />
=== Schedule ===<br />
<br />
* Wed Jan 17, 2018 [[Class meeting for 10-405 Overview|Overview]]. Grading policies and etc, History of Big Data, Complexity theory and cost of important operations<br />
* Mon Jan 22, 2018 [[Class meeting for 10-405 Probability Review|Probability Review]]. Counting for big data and density estimation, streaming Naive Bayes, Rocchio and TFIDF<br />
** '''Start work on''' Assignment 1a: Streaming NB; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw1a.pdf <br />
* Wed Jan 24, 2018 [[Class meeting for 10-405 Streaming Naive Bayes|Streaming Naive Bayes]]. Notes on scalable naive bayes, Alternatives to stream and sort, Local counting in stream and sort, Stream and sort examples<br />
* Mon Jan 29, 2018 [[Class meeting for 10-405 Hadoop Overview|Hadoop Overview]]. Intro to Hadoop, Hadoop Streaming, Debugging Hadoop, Combiners<br />
** '''Start work on''' Assignment 1b: Streaming NB on Hadoop; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw1b.pdf <br />
* Wed Jan 31, 2018 [[Class meeting for 10-405 Workflows For Hadoop|Workflows For Hadoop 1]]. Scalable classification, Abstracts for map-reduce algorithms, Joins in Hadoop<br />
* Mon Feb 5, 2018 [[Class meeting for 10-405 Workflows For Hadoop|Workflows For Hadoop 2]]. Guinea Pig intro, Similarity joins, Similarity joins with TFIDF, Parallel simjoins<br />
** '''Start work on''' Assignment 2a: Naive bayes training in Guinea Pig; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw2a.pdf <br />
* Wed Feb 7, 2018 [[Class meeting for 10-405 Workflows For Hadoop|Workflows For Hadoop 3]]. PageRank, PageRank in Pig and Guinea Pig, K-means in Pig, Spark, Systems built on top of Hadoop<br />
* Mon Feb 12, 2018 [[Class meeting for 10-405 SGD and Hash Kernels|SGD and Hash Kernels]]. Learning as optimization, Logistic regression with SGD, Regularized SGD, Efficient regularized SGD, Hash kernels for logistic regression<br />
** '''Start work on''' Assignment 2b: Naive bayes testing in Guinea Pig. http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw2b.pdf <br />
* Wed Feb 14, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 1]]. The "delta trick", Averaged perceptrons, Debugging ML algorithms<br />
* Mon Feb 19, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 2]]. Hash kernels, Ranking perceptrons, Structured perceptrons<br />
** '''Start work on''' Assignment 3: scalable SGD; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw3.pdf<br />
* Wed Feb 21, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 3]]. Iterative parameter mixing paper, Parallel SGD via Param Mixing<br />
* Mon Feb 26, 2018 [[Class meeting for 10-405 SGD for MF|SGD for MF]]. Matrix factorization, Matrix factorization with SGD, distributed matrix factorization with SGD<br />
* Wed Feb 28, 2018 [[Class meeting for 10-405 Guest lecture - tentative|Guest lecture]] - [http://www.cs.cmu.edu/~kijungs/ Kijung Shin]<br />
* Mon Mar 5, 2018 [[Class meeting for 10-405 Midterm review and catchup|Midterm review and catchup]]. Midterm review<br />
** '''Previous assignment due'''<br />
* Wed Mar 7, 2018 [[Class meeting for 10-405 Midterm|Midterm]]. <br />
* Mon Mar 19, 2018 [[Class meeting for 10-405 Computing with GPUs|Computing with GPUs]]. Introduction to GPUs, CUDA, Vectorization<br />
** '''Start work on''' Assignment 4: Autodiff with IPM part 1/2; Draft at http://www.cs.cmu.edu/~wcohen/10-405/assignments/2016-fall/hw-5-autodiff/main.pdf<br />
* Wed Mar 21, 2018 [[Class meeting for 10-405 Deep Learning|Deep Learning 1]]. Deep learning intro, BackProp following Nielson, Expressiveness of MLPs, Deep learning and GPUs, Exploding and vanishing gradients, Modern deep learning models<br />
* Mon Mar 26, 2018 [[Class meeting for 10-405 Deep Learning|Deep Learning 2]]. Reverse-mode differentiation (autodiff), Some systems using autodiff, Details on Wengert lists, Breakdown of xman.py<br />
* Wed Mar 28, 2018 [[Class meeting for 10-405 Deep Learning|Deep Learning 3]]. Inputs, parameters, updates, Word2vec and GloVE, Recursive ANNs, Convolutional ANNs, Achitectures using RNNs<br />
* Mon Apr 2, 2018 [[Class meeting for 10-405 Randomized Algorithms|Randomized Algorithms 1]]. Bloom filters, The countmin sketch, CM Sketches in Deep Learning<br />
** '''Start work on''' Assignment 5: Autodiff with IPM part 2/2<br />
* Wed Apr 4, 2018 [[Class meeting for 10-405 Randomized Algorithms|Randomized Algorithms 2]]. Review of Bloom filters, Locality sensitive hashing, Online LSH<br />
* Mon Apr 9, 2018 [[Class meeting for 10-405 Graph Architectures for ML|Graph Architectures for ML]]. Graph-based ML architectures, Pregel, Signal-collect, GraphLab, PowerGraph, GraphChi, GraphX<br />
** '''Start work on''' Assignment 6: SSL in Spark<br />
* Wed Apr 11, 2018 [[Class meeting for 10-405 SSL on Graphs|SSL on Graphs]]. Semi-supervised learning intro, Multirank-walk SSL method, Harmonic fields, Modified Adsorption SSL method, MAD with countmin sketches<br />
* Mon Apr 16, 2018 [[Class meeting for 10-405 LDA|LDA 1]]. DGMs for naive Bayes, Gibbs sampling for LDA<br />
* Wed Apr 18, 2018 [[Class meeting for 10-405 LDA|LDA 2]]. Parallelizing LDA, Fast sampling for LDA, DGMs for graphs<br />
* Mon Apr 23, 2018 [[Class meeting for 10-405 Parameter Servers|Parameter Servers]]. Parameter servers, PS vs Hadoop, State Synchronous Parallel (SSP) model, Managed Communication in PS, LDA Sampler with PS<br />
** '''Previous assignment due'''<br />
* Wed Apr 25, 2018 [[Class meeting for 10-405 Unsupervised Learning On Graphs|Unsupervised Learning On Graphs]]. Spectral clustering, Power iteration clustering, Label propagation for clustering non-graph data, Label propagation for SSL on non-graph data<br />
* Mon Apr 30, 2018 [[Class meeting for 10-405 Review session for final|Review session for final]]. <br />
* Wed May 2, 2018 [[Class meeting for 10-405 Final Exam|Final Exam]].</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Machine_Learning_with_Large_Datasets_10-405_in_Spring_2018Machine Learning with Large Datasets 10-405 in Spring 20182018-03-12T14:41:20Z<p>Wcohen: /* Grading Policies */</p>
<hr />
<div>== Instructor and Venue ==<br />
<br />
* Instructor: [http://www.cs.cmu.edu/~wcohen William Cohen], Machine Learning Dept and LTI<br />
* When/where: GHC 4307, 3-4:30pm, Mondays and Wednesdays<br />
* Course Number: ML 10-405<br />
* Prerequisites: <br />
** a CMU intro machine learning course (e.g., 10-701, 10-715, 10-601, 10-401). <br />
*** You may take this 401/601/701 concurrently with 405. <br />
** Good programming skills, e.g., 15-210, or 15-214 or equivalent.<br />
* Course staff: <br />
** William Cohen<br />
** TAs:<br />
*** Vivek Shankar vshanka1@andrew.cmu.edu<br />
*** Nitish Kumar Kulkarni nitishkk@andrew.cmu.edu<br />
*** Vidhan Agarwal vidhana@andrew.cmu.edu<br />
*** Sarthak Garg sarthakg@andrew.cmu.edu<br />
<br />
** Dorothy Holland-Minkley (dfh@cs.cmu.edu, GHC xxxx) - ''course admin''<br />
* Syllabus: [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018]]<br />
<br />
== Office hours ==<br />
<br />
* William: Monday 11am-12, 8217 Gates Hillman<br />
* Vidhan: Tuesday 1:00pm-2:00pm, GHC 6th floor commons (opposite 6105)<br />
* Sarthak: Wednesday 4:30pm-5:30pm, GHC 6th floor commons (opposite 6105)<br />
* Vivek: Thursday 4:30pm-5:30pm, GHC 5th floor commons<br />
* Nitish: Friday 3:00pm-4:00pm, GHC 6th floor commons (opposite 6105)<br />
<br />
== Important virtual places/account information ==<br />
<br />
''Mostly these are TBA''<br />
<br />
* [https://autolab.andrew.cmu.edu/courses/10405-s18/assessments Autolab page]: for assignments<br />
* [http://piazza.com/cmu/spring2018/10405/home Piazza page for class]: home for questions and discussion. You should [http://piazza.com/cmu/spring2018/10405 sign up] for the class with your andrew email.<br />
** For mailing the instructors questions NOT of general interest (eg, "My girlfriend is in town, can I have an extension?") use the emailing list, 10405-Instructors@cs.cmu.edu<br />
* Recorded lectures: Lectures are not recorded for 10-405. However, many of the lectures overlap with 10-605, and recordings of those lectures are still available.<br />
** Previous lectures from 10-605: [https://scs.hosted.panopto.com/Panopto/Pages/Sessions/List.aspx#folderID=7ae48b11-88af-4f02-b179-28f65ea796a4 Fall 2017]<br />
* AFS data repository ''/afs/cs.cmu.edu/project/bigML''<br />
* [https://wiki.pdl.cmu.edu/Stoat Hadoop cluster information] - students will recieve an account on the OpenCloud cluster.<br />
** The location of the streaming JAR for this cluster is: /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar<br />
* [[Amazon Elastic MapReduce information]]<br />
* [[Guide for Happy Hadoop Hacking]] - Malcolm's advice on effectively writing Hadoop programs.<br />
* For TAs/instructors ''only'':<br />
** Configuring Hadoop jobs on Autolab [http://curtis.ml.cmu.edu/w/courses/index.php/Guide_for_Configuring_Autolab]<br />
** Autolab AFS dir ''/afs/cs.cmu.edu/academic/class/10405-s18/'' - students cannot read this. TAs, you may need to use 'aklog cs.cmu.edu' before accessing it.<br />
** [https://docs.google.com/document/d/13uGcV-Alqe0Y_WWqRrdGs7VTn8CeZSvxrG9GK4nx5BU/edit#heading=h.my65mtwmsx4h Planning gDoc]: - students cannot read this<br />
<br />
== Description ==<br />
<br />
Large datasets are difficult to work with for several reasons. They are difficult to visualize, and it is difficult to understand what sort of errors and biases are present in them. They are computationally expensive to process, and often the cost of learning is hard to predict - for instance, and algorithm that runs quickly in a dataset that fits in memory may be exorbitantly expensive when the dataset is too large for memory. Large datasets may also display qualitatively different behavior in terms of which learning methods produce the most accurate predictions.<br />
<br />
This course is intended to provide a student practical knowledge of, and experience with, the issues involving large datasets. Among the issues considered are: scalable learning techniques, such as streaming machine learning techniques; parallel infrastructures such as map-reduce; practical techniques for reducing the memory requirements for learning methods, such as feature hashing and Bloom filters; and techniques for analysis of programs in terms of memory, disk usage, and (for parallel methods) communication complexity.<br />
<br />
=== Course Outcomes ===<br />
<br />
In summary, students who successfully complete the course should be able to:<br />
<br />
* Implement non-trivial algorithms that make effective use of map-reduce infrastructure, like Hadoop, to work with text corpora and large graphs. <br />
* Analyze the time and communication complexity of map-reduce algorithms.<br />
* Discuss coherently the differences between different dataflow languages.<br />
* Implement algorithms using dataflow langages.<br />
* Implement streaming learning algorithms that make use of sparsity to efficiently process examples from a very large feature set.<br />
* Understand and explain the algorithms used for automatic differentiation in deep-learning platforms such as Theano, Tensorflow, Torch, and PyTorch.<br />
* Implement learning algorithms that make use of parameter servers.<br />
* Explain the scalability issues in probabilistic models estimated by sampling methods, and discuss approaches to fast sampling.<br />
* Understand the potential applications of large graphs to semisupervised an unsupervised learning problems.<br />
* Implement a general framework for developing gradient-descent optimizers for machine learning applications.<br />
* Explain the differences between, and recognize potential applications of, randomized data structures such as locality sensitive hashing, Bloom filters, random projections, and count-min sketches.<br />
<br />
== Syllabus ==<br />
<br />
* [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018]]<br />
<br />
While 10-405 is new, it covers similar material to 10-605. Here are some previous syllabi.<br />
<br />
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Fall 2017]]<br />
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Fall 2016]]<br />
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Fall 2015]]<br />
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2015]]<br />
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2014]]<br />
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2013]]<br />
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2012]]<br />
<br />
== Recitations ==<br />
<br />
TBA<br />
<br />
== Prerequisites ==<br />
<br />
An introductory course in machine learning (one of 10-401, 10-601, 10-701, or 10-715) is a prerequisite or a co-requisite. Except in very rare cases I will NOT approve substitutions---it's just not feasible for me to check equivalence of a different intro ML class, or set of classes, for a large number of students.<br />
<br />
The course will include several substantial programming assignments, so an additional prerequisite is 15-210, or 15-214, or familiarity with Python, Unix, and good programming skills.<br />
<br />
== 10-605 vs 10-405 == <br />
<br />
10-405 will be similar in scope to 10-605 as offered in previous years. Students are graded based mainly on programming assignments, a midterm, and a final. <br />
<br />
=== Grading Policies ===<br />
<br />
* 60% assignments. There will be six assignments.<br />
** For most assignments there will be a "checkpoint" deliverable partway thru the assignment:<br />
*** The goal of the check point is to ensure you started the real assignment. <br />
*** It will be worth usually 10 points out of 100.<br />
*** It will not be autolab-graded, so it doesn't need to execute. But if you turn in nothing, or something that clearly doesn't come close to the checkpoint goals, you will be penalized.<br />
* 15% for an in-class midterm exam.<br />
* 20% for an in-class final exam.<br />
* 5% class participation and in-class quizzes.<br />
<br />
* You have the option of replacing one assignment with an "open-ended extension"<br />
** Roughly, this would be an extension to an existing assignment that would increase the programming effort by at least 50%.<br />
** Your deliverable is a handout for the extension, like a TA would produce, and a solution key, including code.<br />
** You should send the instructors email list a rough draft as early as you can, where you sketch out the technical approach you will follow, and mention any research or exploration that you've done to reduce technical risk. <br />
*** ''For example,'' if you decide to implement IPM, will you use All-Reduce or not? if you use it, will you implement it or use an existing implementation, and if it's an existing implementation, have you verified that it works properly on your target cluster? What's the strategy that you will use to ensure mappers don't reload the data from network?<br />
*** We will try and give you feedback promptly, but you should allocate a few days for us to look over these and approve them.<br />
** Example: for HW1B, implement the Rocchio algorithm as well as naive Bayes and compare them.<br />
<br />
== Policies and FAQ ==<br />
<br />
* If you're taking 10-405 and 10-401/601/701/715 concurrently, you don't need to ask for permission - we can now check co-recs automatically. Just sign up.<br />
* '''Is there a textbook?''' No: but I do have written notes for some topics (e.g., [http://www.cs.cmu.edu/~wcohen/10-605/notes/scalable-nb-notes.pdf naive bayes and streaming]) which are linked to from the wiki.<br />
* '''I forgot to take a quiz - can I make it up?''' There are no makeups for the quizzes. Their purpose is to ensure that you follow along with the class and review the lectures promptly, and makeups would defeat that goal.<br />
* '''Can I get an extension on ....?''' Generally no, but you can get 50% credit for up to 48 hrs after the assignment is due, and there are a fixed number of grace days. If you have a documented medical issue or something similar email William.<br />
* '''I'm completely stressed out about all my classes, and/or having a terrible time because... What should I do?''' Here's some common-sense advice (in this context "common-sense" means "nobody is surprised to hear this but nobody actually acts on it either":<br />
** ''Take care of yourself.'' Eat well, make exercise a priority, getting enough sleep, and take some time to relax. <br />
** ''Stay organized.'' Spend some time (say 60min) every week reviewing your long and short term goals, and then spend the rest of the time not obsessing about what's due, but working efficiently on your priorities. I personally recommend the [https://en.wikipedia.org/wiki/Getting_Things_Done gtd] approach.<br />
** ''Get help when you need it.'' All of us benefit from support sometimes. You are not alone. As a CMU student you have access to many resources, and asking for support sooner rather than later is often helpful. If you're experiencing any academic stress, difficult life events, or feelings like anxiety or depression, Counseling and Psychological Services (CaPS) is available: call 412-268-2922 or visit their website at http://www.cmu.edu/counseling/. Or consider reaching out to a friend, faculty or family member you trust.<br />
** ''Look out for each other.'' If your friends or classmates are stressed out or depressed, talk to them and remind them that support is available. Importantly, if you or someone you know is feeling suicidal or in danger of self-harm, call someone immediately, day or night. You can call CaPS (412-268-2922), the Re:solve Crisis Network (888-796-8226) or, if the situation is life threatening, the police (on campus: CMU Police: 412-268-2323, and off campus: 911).<br />
<br />
=== Policy on Collaboration among Students ===<br />
<br />
The purpose of student collaboration is to facilitate learning, not to circumvent it. Studying the material in groups is strongly encouraged. It is also allowed to seek help from other students in understanding the material needed to solve a particular homework problem, provided '''no written notes''' are shared, or are taken at that time, and provided learning is facilitated, not circumvented. The actual solution must be done by each student alone, and the student should be ready to reproduce their solution upon request.<br />
<br />
'''The presence or absence of any form of help or collaboration, whether given or received, must be explicitly stated and disclosed in full by all involved''', on the first page of their assignment. Specifically, each assignment solution must start by answering the following questions:<br />
<br />
(1) Did you receive any help whatsoever from anyone in solving this assignment? Yes / No.<br />
If you answered 'yes', give full details: _______________ (e.g. "Jane explained to me what is asked in Question 3.4")<br />
(2) Did you give any help whatsoever to anyone in solving this assignment? Yes / No.<br />
If you answered 'yes', give full details: _______________ (e.g. "I pointed Joe to section 2.3 to help him with Question 2".<br />
<br />
Additionally, if you share any material or collaborate in any way between the time the assignment is due and the last time when the assignment can be handed in for partial credit, you must notify the instructor of this help in writing (eg via email).<br />
<br />
Collaboration without full disclosure will be handled severely, in compliance with CMU's Policy on Cheating and Plagiarism. Except in usual extenuating circumstances, my policy is to '''fail the student(s) for the entire course'''.<br />
<br />
As a related point, many of the homework assignments used in this class may have been used in prior versions of this class, or in classes at other institutions. Avoiding the use of heavily tested assignments will detract from the main purpose of these assignments, which is to reinforce the material and stimulate thinking. Because some of these assignments may have been used before, solutions to them may be (or may have been) available online, or from other people. It is explicitly forbidden to use any such sources, or to consult people who have solved these problems before. '''You must solve the homework assignments completely on your own'''. I will mostly rely on your wisdom and honor to follow this rule, but if a violation is detected it will be dealt with harshly. Collaboration with other students who have previously taken the class is allowed, but only under the conditions stated below.<br />
<br />
These policies are the same as were used in [http://www.cs.cmu.edu/~roni/10601/ Dr. Rosenfeld's previous version of 601], and my version of [[Machine_Learning_10-601_in_Fall_2014#Policy_on_Collaboration_among_Students|10-601 in fall 2014]].</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Machine_Learning_with_Large_Datasets_10-405_in_Spring_2018Machine Learning with Large Datasets 10-405 in Spring 20182018-03-12T14:41:01Z<p>Wcohen: /* Grading Policies */</p>
<hr />
<div>== Instructor and Venue ==<br />
<br />
* Instructor: [http://www.cs.cmu.edu/~wcohen William Cohen], Machine Learning Dept and LTI<br />
* When/where: GHC 4307, 3-4:30pm, Mondays and Wednesdays<br />
* Course Number: ML 10-405<br />
* Prerequisites: <br />
** a CMU intro machine learning course (e.g., 10-701, 10-715, 10-601, 10-401). <br />
*** You may take this 401/601/701 concurrently with 405. <br />
** Good programming skills, e.g., 15-210, or 15-214 or equivalent.<br />
* Course staff: <br />
** William Cohen<br />
** TAs:<br />
*** Vivek Shankar vshanka1@andrew.cmu.edu<br />
*** Nitish Kumar Kulkarni nitishkk@andrew.cmu.edu<br />
*** Vidhan Agarwal vidhana@andrew.cmu.edu<br />
*** Sarthak Garg sarthakg@andrew.cmu.edu<br />
<br />
** Dorothy Holland-Minkley (dfh@cs.cmu.edu, GHC xxxx) - ''course admin''<br />
* Syllabus: [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018]]<br />
<br />
== Office hours ==<br />
<br />
* William: Monday 11am-12, 8217 Gates Hillman<br />
* Vidhan: Tuesday 1:00pm-2:00pm, GHC 6th floor commons (opposite 6105)<br />
* Sarthak: Wednesday 4:30pm-5:30pm, GHC 6th floor commons (opposite 6105)<br />
* Vivek: Thursday 4:30pm-5:30pm, GHC 5th floor commons<br />
* Nitish: Friday 3:00pm-4:00pm, GHC 6th floor commons (opposite 6105)<br />
<br />
== Important virtual places/account information ==<br />
<br />
''Mostly these are TBA''<br />
<br />
* [https://autolab.andrew.cmu.edu/courses/10405-s18/assessments Autolab page]: for assignments<br />
* [http://piazza.com/cmu/spring2018/10405/home Piazza page for class]: home for questions and discussion. You should [http://piazza.com/cmu/spring2018/10405 sign up] for the class with your andrew email.<br />
** For mailing the instructors questions NOT of general interest (eg, "My girlfriend is in town, can I have an extension?") use the emailing list, 10405-Instructors@cs.cmu.edu<br />
* Recorded lectures: Lectures are not recorded for 10-405. However, many of the lectures overlap with 10-605, and recordings of those lectures are still available.<br />
** Previous lectures from 10-605: [https://scs.hosted.panopto.com/Panopto/Pages/Sessions/List.aspx#folderID=7ae48b11-88af-4f02-b179-28f65ea796a4 Fall 2017]<br />
* AFS data repository ''/afs/cs.cmu.edu/project/bigML''<br />
* [https://wiki.pdl.cmu.edu/Stoat Hadoop cluster information] - students will recieve an account on the OpenCloud cluster.<br />
** The location of the streaming JAR for this cluster is: /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar<br />
* [[Amazon Elastic MapReduce information]]<br />
* [[Guide for Happy Hadoop Hacking]] - Malcolm's advice on effectively writing Hadoop programs.<br />
* For TAs/instructors ''only'':<br />
** Configuring Hadoop jobs on Autolab [http://curtis.ml.cmu.edu/w/courses/index.php/Guide_for_Configuring_Autolab]<br />
** Autolab AFS dir ''/afs/cs.cmu.edu/academic/class/10405-s18/'' - students cannot read this. TAs, you may need to use 'aklog cs.cmu.edu' before accessing it.<br />
** [https://docs.google.com/document/d/13uGcV-Alqe0Y_WWqRrdGs7VTn8CeZSvxrG9GK4nx5BU/edit#heading=h.my65mtwmsx4h Planning gDoc]: - students cannot read this<br />
<br />
== Description ==<br />
<br />
Large datasets are difficult to work with for several reasons. They are difficult to visualize, and it is difficult to understand what sort of errors and biases are present in them. They are computationally expensive to process, and often the cost of learning is hard to predict - for instance, and algorithm that runs quickly in a dataset that fits in memory may be exorbitantly expensive when the dataset is too large for memory. Large datasets may also display qualitatively different behavior in terms of which learning methods produce the most accurate predictions.<br />
<br />
This course is intended to provide a student practical knowledge of, and experience with, the issues involving large datasets. Among the issues considered are: scalable learning techniques, such as streaming machine learning techniques; parallel infrastructures such as map-reduce; practical techniques for reducing the memory requirements for learning methods, such as feature hashing and Bloom filters; and techniques for analysis of programs in terms of memory, disk usage, and (for parallel methods) communication complexity.<br />
<br />
=== Course Outcomes ===<br />
<br />
In summary, students who successfully complete the course should be able to:<br />
<br />
* Implement non-trivial algorithms that make effective use of map-reduce infrastructure, like Hadoop, to work with text corpora and large graphs. <br />
* Analyze the time and communication complexity of map-reduce algorithms.<br />
* Discuss coherently the differences between different dataflow languages.<br />
* Implement algorithms using dataflow langages.<br />
* Implement streaming learning algorithms that make use of sparsity to efficiently process examples from a very large feature set.<br />
* Understand and explain the algorithms used for automatic differentiation in deep-learning platforms such as Theano, Tensorflow, Torch, and PyTorch.<br />
* Implement learning algorithms that make use of parameter servers.<br />
* Explain the scalability issues in probabilistic models estimated by sampling methods, and discuss approaches to fast sampling.<br />
* Understand the potential applications of large graphs to semisupervised an unsupervised learning problems.<br />
* Implement a general framework for developing gradient-descent optimizers for machine learning applications.<br />
* Explain the differences between, and recognize potential applications of, randomized data structures such as locality sensitive hashing, Bloom filters, random projections, and count-min sketches.<br />
<br />
== Syllabus ==<br />
<br />
* [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018]]<br />
<br />
While 10-405 is new, it covers similar material to 10-605. Here are some previous syllabi.<br />
<br />
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Fall 2017]]<br />
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Fall 2016]]<br />
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Fall 2015]]<br />
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2015]]<br />
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2014]]<br />
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2013]]<br />
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2012]]<br />
<br />
== Recitations ==<br />
<br />
TBA<br />
<br />
== Prerequisites ==<br />
<br />
An introductory course in machine learning (one of 10-401, 10-601, 10-701, or 10-715) is a prerequisite or a co-requisite. Except in very rare cases I will NOT approve substitutions---it's just not feasible for me to check equivalence of a different intro ML class, or set of classes, for a large number of students.<br />
<br />
The course will include several substantial programming assignments, so an additional prerequisite is 15-210, or 15-214, or familiarity with Python, Unix, and good programming skills.<br />
<br />
== 10-605 vs 10-405 == <br />
<br />
10-405 will be similar in scope to 10-605 as offered in previous years. Students are graded based mainly on programming assignments, a midterm, and a final. <br />
<br />
=== Grading Policies ===<br />
<br />
* 60% assignments. There will be six assignments.<br />
** For most assignments there will be a "checkpoint" deliverable partway thru the assignment:<br />
*** The goal of the check point is to ensure you started the real assignment. <br />
*** It will be worth usually 10 points out of 100.<br />
*** It will not be autolab-graded, so it doesn't need to execute. But if you turn in nothing, or something that clearly doesn't come close to the checkpoint goals, you will be penalized.<br />
* 15% for an in-class midterm exam.<br />
* 20% for an in-class final exam.<br />
* 5% class participation and in-class quizzes.<br />
<br />
* You have the option of replacing one assignment with an "open-ended extension"<br />
** Roughly, this would be an extension to an existing assignment that would increase the programming effort by at least 50%.<br />
** Your deliverable is a handout for the extension, like a TA would produce, and a solution key, including code.<br />
** You should send the instructors email list a rough draft as early as you can, where you sketch out the technical approach you will follow, and mention any research or exploration that you've done to reduce technical risk. <br />
*** For example, if you decide to implement IPM, will you use All-Reduce or not? if you use it, will you implement it or use an existing implementation, and if it's an existing implementation, have you verified that it works properly on your target cluster? What's the strategy that you will use to ensure mappers don't reload the data from network?<br />
*** We will try and give you feedback promptly, but you should allocate a few days for us to look over these and approve them.<br />
** Example: for HW1B, implement the Rocchio algorithm as well as naive Bayes and compare them.<br />
<br />
== Policies and FAQ ==<br />
<br />
* If you're taking 10-405 and 10-401/601/701/715 concurrently, you don't need to ask for permission - we can now check co-recs automatically. Just sign up.<br />
* '''Is there a textbook?''' No: but I do have written notes for some topics (e.g., [http://www.cs.cmu.edu/~wcohen/10-605/notes/scalable-nb-notes.pdf naive bayes and streaming]) which are linked to from the wiki.<br />
* '''I forgot to take a quiz - can I make it up?''' There are no makeups for the quizzes. Their purpose is to ensure that you follow along with the class and review the lectures promptly, and makeups would defeat that goal.<br />
* '''Can I get an extension on ....?''' Generally no, but you can get 50% credit for up to 48 hrs after the assignment is due, and there are a fixed number of grace days. If you have a documented medical issue or something similar email William.<br />
* '''I'm completely stressed out about all my classes, and/or having a terrible time because... What should I do?''' Here's some common-sense advice (in this context "common-sense" means "nobody is surprised to hear this but nobody actually acts on it either":<br />
** ''Take care of yourself.'' Eat well, make exercise a priority, getting enough sleep, and take some time to relax. <br />
** ''Stay organized.'' Spend some time (say 60min) every week reviewing your long and short term goals, and then spend the rest of the time not obsessing about what's due, but working efficiently on your priorities. I personally recommend the [https://en.wikipedia.org/wiki/Getting_Things_Done gtd] approach.<br />
** ''Get help when you need it.'' All of us benefit from support sometimes. You are not alone. As a CMU student you have access to many resources, and asking for support sooner rather than later is often helpful. If you're experiencing any academic stress, difficult life events, or feelings like anxiety or depression, Counseling and Psychological Services (CaPS) is available: call 412-268-2922 or visit their website at http://www.cmu.edu/counseling/. Or consider reaching out to a friend, faculty or family member you trust.<br />
** ''Look out for each other.'' If your friends or classmates are stressed out or depressed, talk to them and remind them that support is available. Importantly, if you or someone you know is feeling suicidal or in danger of self-harm, call someone immediately, day or night. You can call CaPS (412-268-2922), the Re:solve Crisis Network (888-796-8226) or, if the situation is life threatening, the police (on campus: CMU Police: 412-268-2323, and off campus: 911).<br />
<br />
=== Policy on Collaboration among Students ===<br />
<br />
The purpose of student collaboration is to facilitate learning, not to circumvent it. Studying the material in groups is strongly encouraged. It is also allowed to seek help from other students in understanding the material needed to solve a particular homework problem, provided '''no written notes''' are shared, or are taken at that time, and provided learning is facilitated, not circumvented. The actual solution must be done by each student alone, and the student should be ready to reproduce their solution upon request.<br />
<br />
'''The presence or absence of any form of help or collaboration, whether given or received, must be explicitly stated and disclosed in full by all involved''', on the first page of their assignment. Specifically, each assignment solution must start by answering the following questions:<br />
<br />
(1) Did you receive any help whatsoever from anyone in solving this assignment? Yes / No.<br />
If you answered 'yes', give full details: _______________ (e.g. "Jane explained to me what is asked in Question 3.4")<br />
(2) Did you give any help whatsoever to anyone in solving this assignment? Yes / No.<br />
If you answered 'yes', give full details: _______________ (e.g. "I pointed Joe to section 2.3 to help him with Question 2".<br />
<br />
Additionally, if you share any material or collaborate in any way between the time the assignment is due and the last time when the assignment can be handed in for partial credit, you must notify the instructor of this help in writing (eg via email).<br />
<br />
Collaboration without full disclosure will be handled severely, in compliance with CMU's Policy on Cheating and Plagiarism. Except in usual extenuating circumstances, my policy is to '''fail the student(s) for the entire course'''.<br />
<br />
As a related point, many of the homework assignments used in this class may have been used in prior versions of this class, or in classes at other institutions. Avoiding the use of heavily tested assignments will detract from the main purpose of these assignments, which is to reinforce the material and stimulate thinking. Because some of these assignments may have been used before, solutions to them may be (or may have been) available online, or from other people. It is explicitly forbidden to use any such sources, or to consult people who have solved these problems before. '''You must solve the homework assignments completely on your own'''. I will mostly rely on your wisdom and honor to follow this rule, but if a violation is detected it will be dealt with harshly. Collaboration with other students who have previously taken the class is allowed, but only under the conditions stated below.<br />
<br />
These policies are the same as were used in [http://www.cs.cmu.edu/~roni/10601/ Dr. Rosenfeld's previous version of 601], and my version of [[Machine_Learning_10-601_in_Fall_2014#Policy_on_Collaboration_among_Students|10-601 in fall 2014]].</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Machine_Learning_with_Large_Datasets_10-405_in_Spring_2018Machine Learning with Large Datasets 10-405 in Spring 20182018-03-12T14:39:18Z<p>Wcohen: /* Grading Policies */</p>
<hr />
<div>== Instructor and Venue ==<br />
<br />
* Instructor: [http://www.cs.cmu.edu/~wcohen William Cohen], Machine Learning Dept and LTI<br />
* When/where: GHC 4307, 3-4:30pm, Mondays and Wednesdays<br />
* Course Number: ML 10-405<br />
* Prerequisites: <br />
** a CMU intro machine learning course (e.g., 10-701, 10-715, 10-601, 10-401). <br />
*** You may take this 401/601/701 concurrently with 405. <br />
** Good programming skills, e.g., 15-210, or 15-214 or equivalent.<br />
* Course staff: <br />
** William Cohen<br />
** TAs:<br />
*** Vivek Shankar vshanka1@andrew.cmu.edu<br />
*** Nitish Kumar Kulkarni nitishkk@andrew.cmu.edu<br />
*** Vidhan Agarwal vidhana@andrew.cmu.edu<br />
*** Sarthak Garg sarthakg@andrew.cmu.edu<br />
<br />
** Dorothy Holland-Minkley (dfh@cs.cmu.edu, GHC xxxx) - ''course admin''<br />
* Syllabus: [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018]]<br />
<br />
== Office hours ==<br />
<br />
* William: Monday 11am-12, 8217 Gates Hillman<br />
* Vidhan: Tuesday 1:00pm-2:00pm, GHC 6th floor commons (opposite 6105)<br />
* Sarthak: Wednesday 4:30pm-5:30pm, GHC 6th floor commons (opposite 6105)<br />
* Vivek: Thursday 4:30pm-5:30pm, GHC 5th floor commons<br />
* Nitish: Friday 3:00pm-4:00pm, GHC 6th floor commons (opposite 6105)<br />
<br />
== Important virtual places/account information ==<br />
<br />
''Mostly these are TBA''<br />
<br />
* [https://autolab.andrew.cmu.edu/courses/10405-s18/assessments Autolab page]: for assignments<br />
* [http://piazza.com/cmu/spring2018/10405/home Piazza page for class]: home for questions and discussion. You should [http://piazza.com/cmu/spring2018/10405 sign up] for the class with your andrew email.<br />
** For mailing the instructors questions NOT of general interest (eg, "My girlfriend is in town, can I have an extension?") use the emailing list, 10405-Instructors@cs.cmu.edu<br />
* Recorded lectures: Lectures are not recorded for 10-405. However, many of the lectures overlap with 10-605, and recordings of those lectures are still available.<br />
** Previous lectures from 10-605: [https://scs.hosted.panopto.com/Panopto/Pages/Sessions/List.aspx#folderID=7ae48b11-88af-4f02-b179-28f65ea796a4 Fall 2017]<br />
* AFS data repository ''/afs/cs.cmu.edu/project/bigML''<br />
* [https://wiki.pdl.cmu.edu/Stoat Hadoop cluster information] - students will recieve an account on the OpenCloud cluster.<br />
** The location of the streaming JAR for this cluster is: /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar<br />
* [[Amazon Elastic MapReduce information]]<br />
* [[Guide for Happy Hadoop Hacking]] - Malcolm's advice on effectively writing Hadoop programs.<br />
* For TAs/instructors ''only'':<br />
** Configuring Hadoop jobs on Autolab [http://curtis.ml.cmu.edu/w/courses/index.php/Guide_for_Configuring_Autolab]<br />
** Autolab AFS dir ''/afs/cs.cmu.edu/academic/class/10405-s18/'' - students cannot read this. TAs, you may need to use 'aklog cs.cmu.edu' before accessing it.<br />
** [https://docs.google.com/document/d/13uGcV-Alqe0Y_WWqRrdGs7VTn8CeZSvxrG9GK4nx5BU/edit#heading=h.my65mtwmsx4h Planning gDoc]: - students cannot read this<br />
<br />
== Description ==<br />
<br />
Large datasets are difficult to work with for several reasons. They are difficult to visualize, and it is difficult to understand what sort of errors and biases are present in them. They are computationally expensive to process, and often the cost of learning is hard to predict - for instance, and algorithm that runs quickly in a dataset that fits in memory may be exorbitantly expensive when the dataset is too large for memory. Large datasets may also display qualitatively different behavior in terms of which learning methods produce the most accurate predictions.<br />
<br />
This course is intended to provide a student practical knowledge of, and experience with, the issues involving large datasets. Among the issues considered are: scalable learning techniques, such as streaming machine learning techniques; parallel infrastructures such as map-reduce; practical techniques for reducing the memory requirements for learning methods, such as feature hashing and Bloom filters; and techniques for analysis of programs in terms of memory, disk usage, and (for parallel methods) communication complexity.<br />
<br />
=== Course Outcomes ===<br />
<br />
In summary, students who successfully complete the course should be able to:<br />
<br />
* Implement non-trivial algorithms that make effective use of map-reduce infrastructure, like Hadoop, to work with text corpora and large graphs. <br />
* Analyze the time and communication complexity of map-reduce algorithms.<br />
* Discuss coherently the differences between different dataflow languages.<br />
* Implement algorithms using dataflow langages.<br />
* Implement streaming learning algorithms that make use of sparsity to efficiently process examples from a very large feature set.<br />
* Understand and explain the algorithms used for automatic differentiation in deep-learning platforms such as Theano, Tensorflow, Torch, and PyTorch.<br />
* Implement learning algorithms that make use of parameter servers.<br />
* Explain the scalability issues in probabilistic models estimated by sampling methods, and discuss approaches to fast sampling.<br />
* Understand the potential applications of large graphs to semisupervised an unsupervised learning problems.<br />
* Implement a general framework for developing gradient-descent optimizers for machine learning applications.<br />
* Explain the differences between, and recognize potential applications of, randomized data structures such as locality sensitive hashing, Bloom filters, random projections, and count-min sketches.<br />
<br />
== Syllabus ==<br />
<br />
* [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018]]<br />
<br />
While 10-405 is new, it covers similar material to 10-605. Here are some previous syllabi.<br />
<br />
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Fall 2017]]<br />
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Fall 2016]]<br />
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Fall 2015]]<br />
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2015]]<br />
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2014]]<br />
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2013]]<br />
* [[Syllabus for Machine Learning with Large Datasets 10-605 in Spring 2012]]<br />
<br />
== Recitations ==<br />
<br />
TBA<br />
<br />
== Prerequisites ==<br />
<br />
An introductory course in machine learning (one of 10-401, 10-601, 10-701, or 10-715) is a prerequisite or a co-requisite. Except in very rare cases I will NOT approve substitutions---it's just not feasible for me to check equivalence of a different intro ML class, or set of classes, for a large number of students.<br />
<br />
The course will include several substantial programming assignments, so an additional prerequisite is 15-210, or 15-214, or familiarity with Python, Unix, and good programming skills.<br />
<br />
== 10-605 vs 10-405 == <br />
<br />
10-405 will be similar in scope to 10-605 as offered in previous years. Students are graded based mainly on programming assignments, a midterm, and a final. <br />
<br />
=== Grading Policies ===<br />
<br />
* 60% assignments. There will be six assignments.<br />
** For most assignments there will be a "checkpoint" deliverable partway thru the assignment:<br />
*** The goal of the check point is to ensure you started the real assignment. <br />
*** It will be worth usually 10 points out of 100.<br />
*** It will not be autolab-graded, so it doesn't need to execute. But if you turn in nothing, or something that clearly doesn't come close to the checkpoint goals, you will be penalized.<br />
* 15% for an in-class midterm exam.<br />
* 20% for an in-class final exam.<br />
* 5% class participation and in-class quizzes.<br />
<br />
* You have the option of replacing one assignment with an "open-ended extension"<br />
** Roughly, this would be an extension to an existing assignment that would increase the programming effort by at least 50%.<br />
** Your deliverable is a handout for the extension, like a TA would produce, and a solution key, including code.<br />
** You should send William a rough draft, where you sketch out the technical approach you will follow, and mention any research or exploration that you've done. For example, if you decide to implement IPM, will you use All-Reduce or not? if you use it, will you implement it or use an existing implementation, and if it's an existing implementation, have you verified that it works properly on your target cluster? What's the strategy that you will use to ensure mappers don't reload the data from network?<br />
** Example: for HW1B, implement the Rocchio algorithm as well as naive Bayes and compare them.<br />
<br />
== Policies and FAQ ==<br />
<br />
* If you're taking 10-405 and 10-401/601/701/715 concurrently, you don't need to ask for permission - we can now check co-recs automatically. Just sign up.<br />
* '''Is there a textbook?''' No: but I do have written notes for some topics (e.g., [http://www.cs.cmu.edu/~wcohen/10-605/notes/scalable-nb-notes.pdf naive bayes and streaming]) which are linked to from the wiki.<br />
* '''I forgot to take a quiz - can I make it up?''' There are no makeups for the quizzes. Their purpose is to ensure that you follow along with the class and review the lectures promptly, and makeups would defeat that goal.<br />
* '''Can I get an extension on ....?''' Generally no, but you can get 50% credit for up to 48 hrs after the assignment is due, and there are a fixed number of grace days. If you have a documented medical issue or something similar email William.<br />
* '''I'm completely stressed out about all my classes, and/or having a terrible time because... What should I do?''' Here's some common-sense advice (in this context "common-sense" means "nobody is surprised to hear this but nobody actually acts on it either":<br />
** ''Take care of yourself.'' Eat well, make exercise a priority, getting enough sleep, and take some time to relax. <br />
** ''Stay organized.'' Spend some time (say 60min) every week reviewing your long and short term goals, and then spend the rest of the time not obsessing about what's due, but working efficiently on your priorities. I personally recommend the [https://en.wikipedia.org/wiki/Getting_Things_Done gtd] approach.<br />
** ''Get help when you need it.'' All of us benefit from support sometimes. You are not alone. As a CMU student you have access to many resources, and asking for support sooner rather than later is often helpful. If you're experiencing any academic stress, difficult life events, or feelings like anxiety or depression, Counseling and Psychological Services (CaPS) is available: call 412-268-2922 or visit their website at http://www.cmu.edu/counseling/. Or consider reaching out to a friend, faculty or family member you trust.<br />
** ''Look out for each other.'' If your friends or classmates are stressed out or depressed, talk to them and remind them that support is available. Importantly, if you or someone you know is feeling suicidal or in danger of self-harm, call someone immediately, day or night. You can call CaPS (412-268-2922), the Re:solve Crisis Network (888-796-8226) or, if the situation is life threatening, the police (on campus: CMU Police: 412-268-2323, and off campus: 911).<br />
<br />
=== Policy on Collaboration among Students ===<br />
<br />
The purpose of student collaboration is to facilitate learning, not to circumvent it. Studying the material in groups is strongly encouraged. It is also allowed to seek help from other students in understanding the material needed to solve a particular homework problem, provided '''no written notes''' are shared, or are taken at that time, and provided learning is facilitated, not circumvented. The actual solution must be done by each student alone, and the student should be ready to reproduce their solution upon request.<br />
<br />
'''The presence or absence of any form of help or collaboration, whether given or received, must be explicitly stated and disclosed in full by all involved''', on the first page of their assignment. Specifically, each assignment solution must start by answering the following questions:<br />
<br />
(1) Did you receive any help whatsoever from anyone in solving this assignment? Yes / No.<br />
If you answered 'yes', give full details: _______________ (e.g. "Jane explained to me what is asked in Question 3.4")<br />
(2) Did you give any help whatsoever to anyone in solving this assignment? Yes / No.<br />
If you answered 'yes', give full details: _______________ (e.g. "I pointed Joe to section 2.3 to help him with Question 2".<br />
<br />
Additionally, if you share any material or collaborate in any way between the time the assignment is due and the last time when the assignment can be handed in for partial credit, you must notify the instructor of this help in writing (eg via email).<br />
<br />
Collaboration without full disclosure will be handled severely, in compliance with CMU's Policy on Cheating and Plagiarism. Except in usual extenuating circumstances, my policy is to '''fail the student(s) for the entire course'''.<br />
<br />
As a related point, many of the homework assignments used in this class may have been used in prior versions of this class, or in classes at other institutions. Avoiding the use of heavily tested assignments will detract from the main purpose of these assignments, which is to reinforce the material and stimulate thinking. Because some of these assignments may have been used before, solutions to them may be (or may have been) available online, or from other people. It is explicitly forbidden to use any such sources, or to consult people who have solved these problems before. '''You must solve the homework assignments completely on your own'''. I will mostly rely on your wisdom and honor to follow this rule, but if a violation is detected it will be dealt with harshly. Collaboration with other students who have previously taken the class is allowed, but only under the conditions stated below.<br />
<br />
These policies are the same as were used in [http://www.cs.cmu.edu/~roni/10601/ Dr. Rosenfeld's previous version of 601], and my version of [[Machine_Learning_10-601_in_Fall_2014#Policy_on_Collaboration_among_Students|10-601 in fall 2014]].</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_Parallel_PerceptronsClass meeting for 10-405 Parallel Perceptrons2018-03-06T15:28:53Z<p>Wcohen: /* Readings */</p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Slides ===<br />
<br />
* Lecture 1: [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-1.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-1.pdf in PDF].<br />
* Lecture 2: [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-2.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-2.pdf in PDF].<br />
* Lecture 3: : [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-3.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-3.pdf in PDF].<br />
<br />
=== Quiz ===<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/55 Lecture 1 quiz]<br />
* [https://qna.cs.cmu.edu/#/pages/view/62 Lecture 2 quiz]<br />
* [https://qna.cs.cmu.edu/#/pages/view/245 Lecture 3 quiz]<br />
<br />
=== Readings ===<br />
* [http://www.cs.cmu.edu/~wcohen/10-601/vp-notes/vp.pdf Notes on voted perceptron.] Note: these were updated --[[User:Wcohen|Wcohen]] ([[User talk:Wcohen|talk]]) 10:28, 6 March 2018 (EST)<br />
<br />
=== Optional Readings ===<br />
* [https://cseweb.ucsd.edu/~yfreund/papers/LargeMarginsUsingPerceptron.pdf Large Margin Classification Using the Perceptron Algorithm], Freund and Schapire, MLJ 1999<br />
* [http://www.cs.columbia.edu/~mcollins/papers/tagperc.pdf Discriminative Training Methods for Hidden Markov Models], Collins EMNLP 2002.<br />
*[http://www.ryanmcd.com/papers/parallel_perceptronNAACL2010.pdf Distributed Training Strategies for the Structured Perceptron], R. McDonald, K. Hall and G. Mann, North American Association for Computational Linguistics (NAACL), 2010.<br />
<br />
=== Things to Remember ===<br />
<br />
* Definition of mistake bound<br />
* Definition of perceptron algorithm<br />
** Mistake bound analysis for perceptrons, in terms of margin and example radius<br />
* Converting perceptrons to batch: voted perceptron, averaged perceptron<br />
* Definition of the ranking perceptron and kernel perceptron<br />
* Relationship of hash trick to kernels<br />
<br />
* Parallellizing streaming ML algorithms<br />
** Parameter mixing, and the effect it has on the mistake bounds for perceptrons<br />
** Iterative parameter mixing, and the effect it has on the mistake bounds for perceptrons<br />
* The ALLREDUCE algorithm and its complexity</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_Parallel_PerceptronsClass meeting for 10-405 Parallel Perceptrons2018-03-06T15:28:31Z<p>Wcohen: /* Readings */</p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Slides ===<br />
<br />
* Lecture 1: [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-1.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-1.pdf in PDF].<br />
* Lecture 2: [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-2.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-2.pdf in PDF].<br />
* Lecture 3: : [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-3.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-3.pdf in PDF].<br />
<br />
=== Quiz ===<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/55 Lecture 1 quiz]<br />
* [https://qna.cs.cmu.edu/#/pages/view/62 Lecture 2 quiz]<br />
* [https://qna.cs.cmu.edu/#/pages/view/245 Lecture 3 quiz]<br />
<br />
=== Readings ===<br />
* [http://www.cs.cmu.edu/~wcohen/10-601/vp-notes/vp.pdf Notes on voted perceptron.] (Updated --[[User:Wcohen|Wcohen]] ([[User talk:Wcohen|talk]]) 10:28, 6 March 2018 (EST))<br />
<br />
=== Optional Readings ===<br />
* [https://cseweb.ucsd.edu/~yfreund/papers/LargeMarginsUsingPerceptron.pdf Large Margin Classification Using the Perceptron Algorithm], Freund and Schapire, MLJ 1999<br />
* [http://www.cs.columbia.edu/~mcollins/papers/tagperc.pdf Discriminative Training Methods for Hidden Markov Models], Collins EMNLP 2002.<br />
*[http://www.ryanmcd.com/papers/parallel_perceptronNAACL2010.pdf Distributed Training Strategies for the Structured Perceptron], R. McDonald, K. Hall and G. Mann, North American Association for Computational Linguistics (NAACL), 2010.<br />
<br />
=== Things to Remember ===<br />
<br />
* Definition of mistake bound<br />
* Definition of perceptron algorithm<br />
** Mistake bound analysis for perceptrons, in terms of margin and example radius<br />
* Converting perceptrons to batch: voted perceptron, averaged perceptron<br />
* Definition of the ranking perceptron and kernel perceptron<br />
* Relationship of hash trick to kernels<br />
<br />
* Parallellizing streaming ML algorithms<br />
** Parameter mixing, and the effect it has on the mistake bounds for perceptrons<br />
** Iterative parameter mixing, and the effect it has on the mistake bounds for perceptrons<br />
* The ALLREDUCE algorithm and its complexity</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_Parallel_PerceptronsClass meeting for 10-405 Parallel Perceptrons2018-03-06T15:28:13Z<p>Wcohen: /* Readings */</p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Slides ===<br />
<br />
* Lecture 1: [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-1.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-1.pdf in PDF].<br />
* Lecture 2: [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-2.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-2.pdf in PDF].<br />
* Lecture 3: : [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-3.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-3.pdf in PDF].<br />
<br />
=== Quiz ===<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/55 Lecture 1 quiz]<br />
* [https://qna.cs.cmu.edu/#/pages/view/62 Lecture 2 quiz]<br />
* [https://qna.cs.cmu.edu/#/pages/view/245 Lecture 3 quiz]<br />
<br />
=== Readings ===<br />
* [http://www.cs.cmu.edu/~wcohen/10-601/vp-notes/vp.pdf Notes on voted perceptron.]<br />
<br />
=== Optional Readings ===<br />
* [https://cseweb.ucsd.edu/~yfreund/papers/LargeMarginsUsingPerceptron.pdf Large Margin Classification Using the Perceptron Algorithm], Freund and Schapire, MLJ 1999<br />
* [http://www.cs.columbia.edu/~mcollins/papers/tagperc.pdf Discriminative Training Methods for Hidden Markov Models], Collins EMNLP 2002.<br />
*[http://www.ryanmcd.com/papers/parallel_perceptronNAACL2010.pdf Distributed Training Strategies for the Structured Perceptron], R. McDonald, K. Hall and G. Mann, North American Association for Computational Linguistics (NAACL), 2010.<br />
<br />
=== Things to Remember ===<br />
<br />
* Definition of mistake bound<br />
* Definition of perceptron algorithm<br />
** Mistake bound analysis for perceptrons, in terms of margin and example radius<br />
* Converting perceptrons to batch: voted perceptron, averaged perceptron<br />
* Definition of the ranking perceptron and kernel perceptron<br />
* Relationship of hash trick to kernels<br />
<br />
* Parallellizing streaming ML algorithms<br />
** Parameter mixing, and the effect it has on the mistake bounds for perceptrons<br />
** Iterative parameter mixing, and the effect it has on the mistake bounds for perceptrons<br />
* The ALLREDUCE algorithm and its complexity</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_Parallel_PerceptronsClass meeting for 10-405 Parallel Perceptrons2018-03-06T15:28:05Z<p>Wcohen: /* Readings */</p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Slides ===<br />
<br />
* Lecture 1: [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-1.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-1.pdf in PDF].<br />
* Lecture 2: [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-2.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-2.pdf in PDF].<br />
* Lecture 3: : [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-3.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-3.pdf in PDF].<br />
<br />
=== Quiz ===<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/55 Lecture 1 quiz]<br />
* [https://qna.cs.cmu.edu/#/pages/view/62 Lecture 2 quiz]<br />
* [https://qna.cs.cmu.edu/#/pages/view/245 Lecture 3 quiz]<br />
<br />
=== Readings ===<br />
* [http://www.cs.cmu.edu/~wcohen/10-601/vp-notes/vp.pdf Notes on voted perceptron.]<br />
<br />
=== Readings ===<br />
* [https://cseweb.ucsd.edu/~yfreund/papers/LargeMarginsUsingPerceptron.pdf Large Margin Classification Using the Perceptron Algorithm], Freund and Schapire, MLJ 1999<br />
* [http://www.cs.columbia.edu/~mcollins/papers/tagperc.pdf Discriminative Training Methods for Hidden Markov Models], Collins EMNLP 2002.<br />
*[http://www.ryanmcd.com/papers/parallel_perceptronNAACL2010.pdf Distributed Training Strategies for the Structured Perceptron], R. McDonald, K. Hall and G. Mann, North American Association for Computational Linguistics (NAACL), 2010.<br />
<br />
=== Things to Remember ===<br />
<br />
* Definition of mistake bound<br />
* Definition of perceptron algorithm<br />
** Mistake bound analysis for perceptrons, in terms of margin and example radius<br />
* Converting perceptrons to batch: voted perceptron, averaged perceptron<br />
* Definition of the ranking perceptron and kernel perceptron<br />
* Relationship of hash trick to kernels<br />
<br />
* Parallellizing streaming ML algorithms<br />
** Parameter mixing, and the effect it has on the mistake bounds for perceptrons<br />
** Iterative parameter mixing, and the effect it has on the mistake bounds for perceptrons<br />
* The ALLREDUCE algorithm and its complexity</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_Midterm_review_and_catchupClass meeting for 10-405 Midterm review and catchup2018-03-05T18:36:37Z<p>Wcohen: </p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
Slides:<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/midterm-review.pptx Slides - Powerpoint] <br />
* [http://www.cs.cmu.edu/~wcohen/10-605/midterm-review.pdf Slides - PDF]<br />
<br />
Practice questions from 10-605 in previous years (some questions removed, so these don't exactly reflect the length of the exam):<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/f2017-midterm.pdf practice questions for midterm from 2017 + answers].<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/f2016-midterm.pdf practice questions for midterm from 2016].<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/f2015-midterm.pdf practice questions for midterm from 2015]. <br />
<br />
The final is cumulative so some of the questions below are also relevant:<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/s2014-final.pdf practice questions for final, 2014].<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/practice-questions/s2015-final.pdf practice questions for final, 2015].<br />
<br />
It's also good to review the quizzes, and the review points on the web pages for the lectures.<br />
<br />
There's no quiz this week!</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_SGD_for_MFClass meeting for 10-405 SGD for MF2018-03-05T16:39:50Z<p>Wcohen: /* Things to Remember */</p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018|Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Slides ===<br />
<br />
* [http://www.cs.cmu.edu/~wcohen/10-405/sgd-for-mf.pptx Matrix Factorization via SGD- Powerpoint]<br />
* [http://www.cs.cmu.edu/~wcohen/10-405/sgd-for-mf.pdf Matrix Factorization via SGD - PDF]<br />
<br />
=== Quiz ===<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/67 Quiz for today]<br />
<br />
=== Papers Discussed ===<br />
<br />
* [http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent], Gemulla et al, KDD 2011.<br />
<br />
=== Things to Remember ===<br />
<br />
* Definition of matrix factorization<br />
* Common applications of matrix factorization, and how they map into the MF problem<br />
* Loss functions for matrix factorization that are appropriate for collaborative filtering<br />
* Algorithm and updates for SGD implementation of matrix factorization<br />
* DSGD algorithm - what is done in parallel and what is done sequentially<br />
* Definitions: stratum (aka "diagonal"), interchangable steps</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_Parallel_PerceptronsClass meeting for 10-405 Parallel Perceptrons2018-03-05T16:39:10Z<p>Wcohen: /* Things to Remember */</p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Slides ===<br />
<br />
* Lecture 1: [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-1.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-1.pdf in PDF].<br />
* Lecture 2: [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-2.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-2.pdf in PDF].<br />
* Lecture 3: : [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-3.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-3.pdf in PDF].<br />
<br />
=== Quiz ===<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/55 Lecture 1 quiz]<br />
* [https://qna.cs.cmu.edu/#/pages/view/62 Lecture 2 quiz]<br />
* [https://qna.cs.cmu.edu/#/pages/view/245 Lecture 3 quiz]<br />
<br />
=== Readings ===<br />
* [http://www.cs.cmu.edu/~wcohen/10-707/vp-notes/vp.pdf Notes on voted perceptron.]<br />
<br />
=== Readings ===<br />
* [https://cseweb.ucsd.edu/~yfreund/papers/LargeMarginsUsingPerceptron.pdf Large Margin Classification Using the Perceptron Algorithm], Freund and Schapire, MLJ 1999<br />
* [http://www.cs.columbia.edu/~mcollins/papers/tagperc.pdf Discriminative Training Methods for Hidden Markov Models], Collins EMNLP 2002.<br />
*[http://www.ryanmcd.com/papers/parallel_perceptronNAACL2010.pdf Distributed Training Strategies for the Structured Perceptron], R. McDonald, K. Hall and G. Mann, North American Association for Computational Linguistics (NAACL), 2010.<br />
<br />
=== Things to Remember ===<br />
<br />
* Definition of mistake bound<br />
* Definition of perceptron algorithm<br />
** Mistake bound analysis for perceptrons, in terms of margin and example radius<br />
* Converting perceptrons to batch: voted perceptron, averaged perceptron<br />
* Definition of the ranking perceptron and kernel perceptron<br />
* Relationship of hash trick to kernels<br />
<br />
* Parallellizing streaming ML algorithms<br />
** Parameter mixing, and the effect it has on the mistake bounds for perceptrons<br />
** Iterative parameter mixing, and the effect it has on the mistake bounds for perceptrons<br />
* The ALLREDUCE algorithm and its complexity</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_Parallel_PerceptronsClass meeting for 10-405 Parallel Perceptrons2018-03-05T16:37:01Z<p>Wcohen: /* Things to Remember */</p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Slides ===<br />
<br />
* Lecture 1: [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-1.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-1.pdf in PDF].<br />
* Lecture 2: [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-2.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-2.pdf in PDF].<br />
* Lecture 3: : [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-3.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-3.pdf in PDF].<br />
<br />
=== Quiz ===<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/55 Lecture 1 quiz]<br />
* [https://qna.cs.cmu.edu/#/pages/view/62 Lecture 2 quiz]<br />
* [https://qna.cs.cmu.edu/#/pages/view/245 Lecture 3 quiz]<br />
<br />
=== Readings ===<br />
* [http://www.cs.cmu.edu/~wcohen/10-707/vp-notes/vp.pdf Notes on voted perceptron.]<br />
<br />
=== Readings ===<br />
* [https://cseweb.ucsd.edu/~yfreund/papers/LargeMarginsUsingPerceptron.pdf Large Margin Classification Using the Perceptron Algorithm], Freund and Schapire, MLJ 1999<br />
* [http://www.cs.columbia.edu/~mcollins/papers/tagperc.pdf Discriminative Training Methods for Hidden Markov Models], Collins EMNLP 2002.<br />
*[http://www.ryanmcd.com/papers/parallel_perceptronNAACL2010.pdf Distributed Training Strategies for the Structured Perceptron], R. McDonald, K. Hall and G. Mann, North American Association for Computational Linguistics (NAACL), 2010.<br />
<br />
=== Things to Remember ===<br />
<br />
* Definition of mistake bound<br />
* Definition of perceptron algorithm<br />
** Mistake bound analysis for perceptrons, in terms of margin and example radius<br />
* Converting perceptrons to batch: voted perceptron, averaged perceptron<br />
* Definition of the ranking perceptron and kernel perceptron<br />
* Relationship of hash trick to kernels</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_Parallel_PerceptronsClass meeting for 10-405 Parallel Perceptrons2018-03-05T16:36:04Z<p>Wcohen: /* Things to Remember */</p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Slides ===<br />
<br />
* Lecture 1: [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-1.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-1.pdf in PDF].<br />
* Lecture 2: [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-2.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-2.pdf in PDF].<br />
* Lecture 3: : [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-3.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-3.pdf in PDF].<br />
<br />
=== Quiz ===<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/55 Lecture 1 quiz]<br />
* [https://qna.cs.cmu.edu/#/pages/view/62 Lecture 2 quiz]<br />
* [https://qna.cs.cmu.edu/#/pages/view/245 Lecture 3 quiz]<br />
<br />
=== Readings ===<br />
* [http://www.cs.cmu.edu/~wcohen/10-707/vp-notes/vp.pdf Notes on voted perceptron.]<br />
<br />
=== Readings ===<br />
* [https://cseweb.ucsd.edu/~yfreund/papers/LargeMarginsUsingPerceptron.pdf Large Margin Classification Using the Perceptron Algorithm], Freund and Schapire, MLJ 1999<br />
* [http://www.cs.columbia.edu/~mcollins/papers/tagperc.pdf Discriminative Training Methods for Hidden Markov Models], Collins EMNLP 2002.<br />
*[http://www.ryanmcd.com/papers/parallel_perceptronNAACL2010.pdf Distributed Training Strategies for the Structured Perceptron], R. McDonald, K. Hall and G. Mann, North American Association for Computational Linguistics (NAACL), 2010.<br />
<br />
=== Things to Remember ===<br />
<br />
* Definition of mistake bound<br />
* Definition of perceptron algorithm<br />
** Mistake bound analysis for perceptrons, in terms of margin and example radius<br />
* Converting perceptrons to batch: voted perceptron, averaged perceptron<br />
* Definition of the ranking perceptron and kernel perceptron</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_Parallel_PerceptronsClass meeting for 10-405 Parallel Perceptrons2018-03-05T16:32:50Z<p>Wcohen: /* Readings */</p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Slides ===<br />
<br />
* Lecture 1: [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-1.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-1.pdf in PDF].<br />
* Lecture 2: [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-2.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-2.pdf in PDF].<br />
* Lecture 3: : [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-3.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/perceptrons-3.pdf in PDF].<br />
<br />
=== Quiz ===<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/55 Lecture 1 quiz]<br />
* [https://qna.cs.cmu.edu/#/pages/view/62 Lecture 2 quiz]<br />
* [https://qna.cs.cmu.edu/#/pages/view/245 Lecture 3 quiz]<br />
<br />
=== Readings ===<br />
* [http://www.cs.cmu.edu/~wcohen/10-707/vp-notes/vp.pdf Notes on voted perceptron.]<br />
<br />
=== Readings ===<br />
* [https://cseweb.ucsd.edu/~yfreund/papers/LargeMarginsUsingPerceptron.pdf Large Margin Classification Using the Perceptron Algorithm], Freund and Schapire, MLJ 1999<br />
* [http://www.cs.columbia.edu/~mcollins/papers/tagperc.pdf Discriminative Training Methods for Hidden Markov Models], Collins EMNLP 2002.<br />
*[http://www.ryanmcd.com/papers/parallel_perceptronNAACL2010.pdf Distributed Training Strategies for the Structured Perceptron], R. McDonald, K. Hall and G. Mann, North American Association for Computational Linguistics (NAACL), 2010.<br />
<br />
=== Things to Remember ===</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_SGD_and_Hash_KernelsClass meeting for 10-405 SGD and Hash Kernels2018-03-05T16:32:05Z<p>Wcohen: /* Things to Remember */</p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Slides ===<br />
<br />
Stochastic gradient descent:<br />
<br />
* [http://www.cs.cmu.edu/~wcohen/10-405/sgd.pptx Slides in Powerpoint]<br />
* [http://www.cs.cmu.edu/~wcohen/10-405/sgd.pdf Slides in PDF]<br />
<br />
=== Quiz ===<br />
<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/50 Today's quiz]<br />
<br />
=== Readings for the Class ===<br />
<br />
* [http://www.cs.cmu.edu/~wcohen/10-605/notes/sgd-notes.pdf William's notes on SGD]<br />
<br />
=== Optional readings ===<br />
<br />
* For logistic regression, and the sparse updates for it: [http://lingpipe.files.wordpress.com/2008/04/lazysgdregression.pdf Lazy Sparse Stochastic Gradient Descent for Regularized Multinomial Logistic Regression], Carpenter, Bob. 2008. See also [http://alias-i.com/lingpipe/demos/tutorial/logistic-regression/read-me.html his blog post] on logistic regression. I also recommend [http://www.cs.cmu.edu/~wcohen/10-605/notes/elkan-logreg.pdf Charles Elkan's notes on logistic regression] (local saved copy).<br />
* For hash kernels: [http://arxiv.org/pdf/0902.2206.pdf Feature Hashing for Large Scale Multitask Learning], Weinberger et al, ICML 2009.<br />
<br />
=== Things to Remember ===<br />
<br />
<br />
* Approach of learning by optimization<br />
* Optimization goal for logistic regression<br />
* Key terms: logistic function, sigmoid function, log conditional likelihood, loss function, stochastic gradient descent<br />
* Updates for logistic regression, with and without regularization<br />
* The formal properties of sparse logistic regression<br />
** Whether it is exact or approximate<br />
** How it changes memory and time usage<br />
* Formalization of logistic regression as matching expectations between data and model<br />
* Regularization and how it interacts with overfitting<br />
* How "sparsifying" regularization affects run-time and memory<br />
* What the "hash trick" is and why it should work</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_Workflows_For_HadoopClass meeting for 10-405 Workflows For Hadoop2018-03-05T15:12:58Z<p>Wcohen: /* Things to Remember */</p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Slides ===<br />
<br />
* First lecture: Slides [http://www.cs.cmu.edu/~wcohen/10-405/workflows-1.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/workflows-1.pdf in PDF].<br />
* Second lecture: Slides [http://www.cs.cmu.edu/~wcohen/10-405/workflows-2.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/workflows-2.pdf in PDF].<br />
* Third lecture: Slides [http://www.cs.cmu.edu/~wcohen/10-405/workflows-3.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/workflows-3.pdf in PDF].<br />
<br />
=== Quizzes ===<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/170 Quiz for first lecture]<br />
* [https://qna.cs.cmu.edu/#/pages/view/175 Quiz for second lecture]<br />
* [https://qna.cs.cmu.edu/#/pages/view/178 Quiz for third lecture]<br />
<br />
=== Readings ===<br />
<br />
* Pig: none required. A nice on-line resource for PIG is the on-line version of the O'Reilly Book [http://chimera.labs.oreilly.com/books/1234000001811/index.html Programming Pig].<br />
*Optional: [http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html Introduction to Information Retrieval], by Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütz, has a fairly [http://nlp.stanford.edu/IR-book/html/htmledition/scoring-term-weighting-and-the-vector-space-model-1.html self-contained chapter on the vector space model], including Rocchio's method.<br />
<br />
=== Also discussed ===<br />
<br />
* Joachims, Thorsten, [http://www.cs.cornell.edu/People/tj/publications/joachims_97a.pdf A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization]. Proceedings of International Conference on Machine Learning (ICML), 1997.<br />
* Relevance Feedback in Information Retrieval, SMART Retrieval System Experiments in Automatic Document Processing, 1971, Prentice Hall Inc. <br />
* Schapire et al, [http://dl.acm.org/citation.cfm?id=290996 Boosting and Rocchio applied to text filtering], SIGIR 98.<br />
<br />
=== Things to Remember ===<br />
<br />
* Combiners and how/when they improve efficiency<br />
* What dataflow languages are, what sort of abstract operations they use, and what the complexity of these operations is.<br />
* How joins are implemented in dataflow<br />
** The difference between map-side and reduce-side joins and how they are implemented<br />
** When to use map-side vs reduce-side joins<br />
* Definition of a similarity join/soft join.<br />
<br />
* Complexity of operations like similarity join, TFIDF computation, etc.<br />
<br />
* What the PageRank algorithm is<br />
* Common ways of representing graphs in map-reduce system<br />
** A list of edges<br />
** A list of nodes with outlinks<br />
* Why iteration is often expensive in pure dataflow algorithms.<br />
* How Spark differs from and/or is similar to other dataflow algorithms<br />
** Actions/transformations<br />
** RDDs<br />
** Caching<br />
<br />
* How to implement k-means in a map-reduce setting with dataflow<br />
** Not discussed in class, but in the slide deck!</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_Workflows_For_HadoopClass meeting for 10-405 Workflows For Hadoop2018-03-05T15:11:06Z<p>Wcohen: /* Things to Remember */</p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Slides ===<br />
<br />
* First lecture: Slides [http://www.cs.cmu.edu/~wcohen/10-405/workflows-1.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/workflows-1.pdf in PDF].<br />
* Second lecture: Slides [http://www.cs.cmu.edu/~wcohen/10-405/workflows-2.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/workflows-2.pdf in PDF].<br />
* Third lecture: Slides [http://www.cs.cmu.edu/~wcohen/10-405/workflows-3.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/workflows-3.pdf in PDF].<br />
<br />
=== Quizzes ===<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/170 Quiz for first lecture]<br />
* [https://qna.cs.cmu.edu/#/pages/view/175 Quiz for second lecture]<br />
* [https://qna.cs.cmu.edu/#/pages/view/178 Quiz for third lecture]<br />
<br />
=== Readings ===<br />
<br />
* Pig: none required. A nice on-line resource for PIG is the on-line version of the O'Reilly Book [http://chimera.labs.oreilly.com/books/1234000001811/index.html Programming Pig].<br />
*Optional: [http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html Introduction to Information Retrieval], by Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütz, has a fairly [http://nlp.stanford.edu/IR-book/html/htmledition/scoring-term-weighting-and-the-vector-space-model-1.html self-contained chapter on the vector space model], including Rocchio's method.<br />
<br />
=== Also discussed ===<br />
<br />
* Joachims, Thorsten, [http://www.cs.cornell.edu/People/tj/publications/joachims_97a.pdf A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization]. Proceedings of International Conference on Machine Learning (ICML), 1997.<br />
* Relevance Feedback in Information Retrieval, SMART Retrieval System Experiments in Automatic Document Processing, 1971, Prentice Hall Inc. <br />
* Schapire et al, [http://dl.acm.org/citation.cfm?id=290996 Boosting and Rocchio applied to text filtering], SIGIR 98.<br />
<br />
=== Things to Remember ===<br />
<br />
* Combiners and how/when they improve efficiency<br />
* What dataflow languages are, what sort of abstract operations they use, and what the complexity of these operations is.<br />
* How joins are implemented in dataflow<br />
** The difference between map-side and reduce-side joins and how they are implemented<br />
** When to use map-side vs reduce-side joins<br />
* What the PageRank algorithm is<br />
* Common ways of representing graphs in map-reduce system<br />
** A list of edges<br />
** A list of nodes with outlinks<br />
* Why iteration is often expensive in pure dataflow algorithms.<br />
* How Spark differs from and/or is similar to other dataflow algorithms<br />
** Actions/transformations<br />
** RDDs<br />
** Caching<br />
* Definition of a similarity join/soft join.<br />
<br />
* How to implement k-means in a map-reduce setting with dataflow<br />
** Not discussed in class, but in the slide deck!</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_Workflows_For_HadoopClass meeting for 10-405 Workflows For Hadoop2018-03-05T15:10:21Z<p>Wcohen: /* Things to Remember */</p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Slides ===<br />
<br />
* First lecture: Slides [http://www.cs.cmu.edu/~wcohen/10-405/workflows-1.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/workflows-1.pdf in PDF].<br />
* Second lecture: Slides [http://www.cs.cmu.edu/~wcohen/10-405/workflows-2.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/workflows-2.pdf in PDF].<br />
* Third lecture: Slides [http://www.cs.cmu.edu/~wcohen/10-405/workflows-3.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-405/workflows-3.pdf in PDF].<br />
<br />
=== Quizzes ===<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/170 Quiz for first lecture]<br />
* [https://qna.cs.cmu.edu/#/pages/view/175 Quiz for second lecture]<br />
* [https://qna.cs.cmu.edu/#/pages/view/178 Quiz for third lecture]<br />
<br />
=== Readings ===<br />
<br />
* Pig: none required. A nice on-line resource for PIG is the on-line version of the O'Reilly Book [http://chimera.labs.oreilly.com/books/1234000001811/index.html Programming Pig].<br />
*Optional: [http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html Introduction to Information Retrieval], by Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütz, has a fairly [http://nlp.stanford.edu/IR-book/html/htmledition/scoring-term-weighting-and-the-vector-space-model-1.html self-contained chapter on the vector space model], including Rocchio's method.<br />
<br />
=== Also discussed ===<br />
<br />
* Joachims, Thorsten, [http://www.cs.cornell.edu/People/tj/publications/joachims_97a.pdf A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization]. Proceedings of International Conference on Machine Learning (ICML), 1997.<br />
* Relevance Feedback in Information Retrieval, SMART Retrieval System Experiments in Automatic Document Processing, 1971, Prentice Hall Inc. <br />
* Schapire et al, [http://dl.acm.org/citation.cfm?id=290996 Boosting and Rocchio applied to text filtering], SIGIR 98.<br />
<br />
=== Things to Remember ===<br />
<br />
* Combiners and how/when they improve efficiency<br />
* What dataflow languages are, what sort of abstract operations they use, and what the complexity of these operations is.<br />
* How joins are implemented in dataflow (and the difference between map-side and reduce-side joins)<br />
* What the PageRank algorithm is<br />
* Common ways of representing graphs in map-reduce system<br />
** A list of edges<br />
** A list of nodes with outlinks<br />
* Why iteration is often expensive in pure dataflow algorithms.<br />
* How Spark differs from and/or is similar to other dataflow algorithms<br />
** Actions/transformations<br />
** RDDs<br />
** Caching<br />
* Definition of a similarity join/soft join.<br />
<br />
* How to implement k-means in a map-reduce setting with dataflow<br />
** Not discussed in class, but in the slide deck!</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_Hadoop_OverviewClass meeting for 10-405 Hadoop Overview2018-03-05T15:09:13Z<p>Wcohen: /* Things to Remember */</p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Slides ===<br />
<br />
Map-reduce overview:<br />
<br />
* [http://www.cs.cmu.edu/~wcohen/10-405/map-reduce.pptx Map-Reduce overview - ppt]<br />
* [http://www.cs.cmu.edu/~wcohen/10-405/map-reduce.pdf Map-Reduce overview - pdf]<br />
<br />
=== Quiz ===<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/244 Today's quiz]<br />
<br />
=== Readings for the Class ===<br />
<br />
* There are lots of on-line tutorials for Hadoop. The [http://shop.oreilly.com/product/0636920010388.do O'Reilly Book] is also quite good. You might also look at this [http://www.cs.cmu.edu/~wcohen/10-605/annotated-hadoop-log.txt annotated log of me interacting with streaming Hadoop].<br />
<br />
=== Things to Remember ===<br />
<br />
* Hadoop terminology: HDFS, shards, job tracker, combiner, mapper, reducer, ...<br />
* The primary phases of a map-reduce computation, and what happens in each<br />
** Map<br />
** Shuffle/sort<br />
** Reduce<br />
* Where data might be transmitted across the network<br />
* How data is stored in Hadoop<br />
** Consequences of large block size for streaming and storage efficiency</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_Streaming_Naive_BayesClass meeting for 10-405 Streaming Naive Bayes2018-03-05T15:07:08Z<p>Wcohen: /* Things to Remember */</p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Slides ===<br />
<br />
* [http://www.cs.cmu.edu/~wcohen/10-405/stream-and-sort.pptx Slides in Powerpoint] - the stream-and-sort pattern, and large-vocabulary Naive Bayes<br />
* [http://www.cs.cmu.edu/~wcohen/10-405/stream-and-sort.pdf Slides in PDF] <br />
<br />
=== Quiz ===<br />
<br />
* [https://qna.cs.cmu.edu/#/pages/view/161 Today's quiz].<br />
<br />
=== Readings for the Class ===<br />
<br />
* Required: [http://www.cs.cmu.edu/~wcohen/10-605/notes/scalable-nb-notes.pdf my notes on streaming and Naive Bayes]<br />
* Optional: If you're interested in reading more about smoothing for naive Bayes, I recommend this paper: Peng, Fuchun, Dale Schuurmans, and Shaojun Wang. "Augmenting naive Bayes classifiers with statistical language models." Information Retrieval 7.3 (2004): 317-345.<br />
<br />
=== Things to Remember ===<br />
<br />
* What TFIDF weighting is and how to compute it<br />
** Computing DFs requires extra pass over training set<br />
* How it's used in Rocchio<br />
<br />
* Zipf's law and the prevalence of rare features/words<br />
<br />
* Communication complexity<br />
* Stream and sort<br />
** Complexity of merge sort<br />
** How pipes implement parallel processing<br />
** How buffering output before a sort can improve performance<br />
** How stream-and-sort can implement event-counting for naive Bayes</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_Guest_lecture_-_tentativeClass meeting for 10-405 Guest lecture - tentative2018-02-28T18:48:01Z<p>Wcohen: </p>
<hr />
<div>This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018|Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Slides ===<br />
<br />
[http://www.cs.cmu.edu/~kijungs/etc/10-405.pdf Large Scale Matrix/Tensor Factorization - PDF]</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_Guest_lecture_-_tentativeClass meeting for 10-405 Guest lecture - tentative2018-02-28T18:47:45Z<p>Wcohen: /* Slides */</p>
<hr />
<div><br />
<br />
<br />
<br />
This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-405 in Spring 2018|schedule]] for the course [[Machine Learning with Large Datasets 10-405 in Spring 2018|Machine Learning with Large Datasets 10-405 in Spring 2018]].<br />
<br />
=== Slides ===<br />
<br />
[http://www.cs.cmu.edu/~kijungs/etc/10-405.pdf Large Scale Matrix/Tensor Factorization - PDF]</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Class_meeting_for_10-405_Guest_lecture_-_tentativeClass meeting for 10-405 Guest lecture - tentative2018-02-28T18:46:44Z<p>Wcohen: Created page with " === Slides === [http://www.cs.cmu.edu/~kijungs/etc/10-405.pdf PDF]"</p>
<hr />
<div><br />
<br />
=== Slides ===<br />
<br />
[http://www.cs.cmu.edu/~kijungs/etc/10-405.pdf PDF]</div>Wcohenhttp://curtis.ml.cmu.edu/w/courses/index.php/Syllabus_for_Machine_Learning_with_Large_Datasets_10-405_in_Spring_2018Syllabus for Machine Learning with Large Datasets 10-405 in Spring 20182018-02-28T15:05:18Z<p>Wcohen: /* Schedule */</p>
<hr />
<div>This is the syllabus for [[Machine Learning with Large Datasets 10-405 in Spring 2018]]. <br />
<br />
== Ideas for extensions to the HW assignments ==<br />
<br />
This is not a complete list! you can use any of these as a starting point, but feel free to think up your own extensions.<br />
<br />
HW2 (NB in GuineaPig):<br />
<br />
* The assignment proposes one particular scheme for parallelizing the training/testing algorithm. Consider another parallelization algorithm.<br />
* Implement a similarly scalable Rocchio algorithm and compare it with NB.<br />
* Reimplement the same algorithm in Spark (or some other dataflow language) and compare.<br />
<br />
HW3 (Logistic regression and SGD)<br />
* Evaluate the hash trick for Naive Bayes systematically on a series of datasets.<br />
* Implement a parameter-mixing version of logistic regression and evaluate it.<br />
* A [https://www.aclweb.org/anthology/P12-2018 recent paper] proposes (roughly) using SVM with NB-transformed features. Implement this and compare.<br />
* The personalization method described in class is based on [https://www.umiacs.umd.edu/~hal/docs/daume07easyadapt.pdf a transfer learning method] which works similarly. Many wikipedia pages are available in multiple languages, and works in related languages tend to be lexically similar (eg, "astrónomo" is Spanish for "astronomer"). Suppose features were character n-grams (eg "astr", "stro", "tron", ...) - does domain transfer work for the task of classifying wikipedia pages? Construct a dataset and experiment to test this hypothesis.<br />
<br />
=== Notes ===<br />
<br />
* Homeworks, unless otherwise posted, will be due when the next HW comes out.<br />
* Lecture notes and/or slides will be (re)posted around the time of the lectures.<br />
<br />
=== Schedule ===<br />
<br />
* Wed Jan 17, 2018 [[Class meeting for 10-405 Overview|Overview]]. Grading policies and etc, History of Big Data, Complexity theory and cost of important operations<br />
* Mon Jan 22, 2018 [[Class meeting for 10-405 Probability Review|Probability Review]]. Counting for big data and density estimation, streaming Naive Bayes, Rocchio and TFIDF<br />
** '''Start work on''' Assignment 1a: Streaming NB; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw1a.pdf <br />
* Wed Jan 24, 2018 [[Class meeting for 10-405 Streaming Naive Bayes|Streaming Naive Bayes]]. Notes on scalable naive bayes, Alternatives to stream and sort, Local counting in stream and sort, Stream and sort examples<br />
* Mon Jan 29, 2018 [[Class meeting for 10-405 Hadoop Overview|Hadoop Overview]]. Intro to Hadoop, Hadoop Streaming, Debugging Hadoop, Combiners<br />
** '''Start work on''' Assignment 1b: Streaming NB on Hadoop; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw1b.pdf <br />
* Wed Jan 31, 2018 [[Class meeting for 10-405 Workflows For Hadoop|Workflows For Hadoop 1]]. Scalable classification, Abstracts for map-reduce algorithms, Joins in Hadoop<br />
* Mon Feb 5, 2018 [[Class meeting for 10-405 Workflows For Hadoop|Workflows For Hadoop 2]]. Guinea Pig intro, Similarity joins, Similarity joins with TFIDF, Parallel simjoins<br />
** '''Start work on''' Assignment 2a: Naive bayes training in Guinea Pig; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw2a.pdf <br />
* Wed Feb 7, 2018 [[Class meeting for 10-405 Workflows For Hadoop|Workflows For Hadoop 3]]. PageRank, PageRank in Pig and Guinea Pig, K-means in Pig, Spark, Systems built on top of Hadoop<br />
* Mon Feb 12, 2018 [[Class meeting for 10-405 SGD and Hash Kernels|SGD and Hash Kernels]]. Learning as optimization, Logistic regression with SGD, Regularized SGD, Efficient regularized SGD, Hash kernels for logistic regression<br />
** '''Start work on''' Assignment 2b: Naive bayes testing in Guinea Pig. http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw2b.pdf <br />
* Wed Feb 14, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 1]]. The "delta trick", Averaged perceptrons, Debugging ML algorithms<br />
* Mon Feb 19, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 2]]. Hash kernels, Ranking perceptrons, Structured perceptrons<br />
** '''Start work on''' Assignment 3: scalable SGD; http://www.cs.cmu.edu/~wcohen/10-405/assignments/hw3.pdf<br />
* Wed Feb 21, 2018 [[Class meeting for 10-405 Parallel Perceptrons|Parallel Perceptrons 3]]. Iterative parameter mixing paper, Parallel SGD via Param Mixing<br />
* Mon Feb 26, 2018 [[Class meeting for 10-405 SGD for MF|SGD for MF]]. Matrix factorization, Matrix factorization with SGD, distributed matrix factorization with SGD<br />
* Wed Feb 28, 2018 [[Class meeting for 10-405 Guest lecture - tentative|Guest lecture]] - [http://www.cs.cmu.edu/~kijungs/ Kijung Shin]<br />
* Mon Mar 5, 2018 [[Class meeting for 10-405 Midterm review and catchup|Midterm review and catchup]]. Midterm review<br />
** '''Previous assignment due'''<br />
* Wed Mar 7, 2018 [[Class meeting for 10-405 Midterm|Midterm]]. <br />
* Mon Mar 19, 2018 [[Class meeting for 10-405 Computing with GPUs|Computing with GPUs]]. Introduction to GPUs, CUDA, Vectorization<br />
** '''Start work on''' Assignment 4: Autodiff with IPM part 1/2; Draft at http://www.cs.cmu.edu/~wcohen/10-405/assignments/2016-fall/hw-5-autodiff/main.pdf<br />
* Wed Mar 21, 2018 [[Class meeting for 10-405 Deep Learning|Deep Learning 1]]. Deep learning intro, BackProp following Nielson, Expressiveness of MLPs, Deep learning and GPUs, Exploding and vanishing gradients, Modern deep learning models<br />
* Mon Mar 26, 2018 [[Class meeting for 10-405 Deep Learning|Deep Learning 2]]. Reverse-mode differentiation (autodiff), Some systems using autodiff, Details on Wengert lists, Breakdown of xman.py<br />
* Wed Mar 28, 2018 [[Class meeting for 10-405 Deep Learning|Deep Learning 3]]. Inputs, parameters, updates, Word2vec and GloVE, Recursive ANNs, Convolutional ANNs, Achitectures using RNNs<br />
* Mon Apr 2, 2018 [[Class meeting for 10-405 Randomized Algorithms|Randomized Algorithms 1]]. Bloom filters, The countmin sketch, CM Sketches in Deep Learning<br />
** '''Start work on''' Assignment 5: Autodiff with IPM part 2/2<br />
* Wed Apr 4, 2018 [[Class meeting for 10-405 Randomized Algorithms|Randomized Algorithms 2]]. Review of Bloom filters, Locality sensitive hashing, Online LSH<br />
* Mon Apr 9, 2018 [[Class meeting for 10-405 Graph Architectures for ML|Graph Architectures for ML]]. Graph-based ML architectures, Pregel, Signal-collect, GraphLab, PowerGraph, GraphChi, GraphX<br />
** '''Start work on''' Assignment 6: SSL in Spark<br />
* Wed Apr 11, 2018 [[Class meeting for 10-405 SSL on Graphs|SSL on Graphs]]. Semi-supervised learning intro, Multirank-walk SSL method, Harmonic fields, Modified Adsorption SSL method, MAD with countmin sketches<br />
* Mon Apr 16, 2018 [[Class meeting for 10-405 LDA|LDA 1]]. DGMs for naive Bayes, Gibbs sampling for LDA<br />
* Wed Apr 18, 2018 [[Class meeting for 10-405 LDA|LDA 2]]. Parallelizing LDA, Fast sampling for LDA, DGMs for graphs<br />
* Mon Apr 23, 2018 [[Class meeting for 10-405 Parameter Servers|Parameter Servers]]. Parameter servers, PS vs Hadoop, State Synchronous Parallel (SSP) model, Managed Communication in PS, LDA Sampler with PS<br />
** '''Previous assignment due'''<br />
* Wed Apr 25, 2018 [[Class meeting for 10-405 Unsupervised Learning On Graphs|Unsupervised Learning On Graphs]]. Spectral clustering, Power iteration clustering, Label propagation for clustering non-graph data, Label propagation for SSL on non-graph data<br />
* Mon Apr 30, 2018 [[Class meeting for 10-405 Review session for final|Review session for final]]. <br />
* Wed May 2, 2018 [[Class meeting for 10-405 Final Exam|Final Exam]].</div>Wcohen