Difference between revisions of "Class meeting for 10-605 Workflows For Hadoop"

From Cohen Courses
Jump to navigationJump to search
 
(13 intermediate revisions by the same user not shown)
Line 1: Line 1:
This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-605 in Fall 2016|schedule]] for the course [[Machine Learning with Large Datasets 10-605 in Fall_2016]].
+
This is one of the class meetings on the [[Syllabus for Machine Learning with Large Datasets 10-605 in Fall 2017|schedule]] for the course [[Machine Learning with Large Datasets 10-605 in Fall_2017]].
  
 
=== Slides ===
 
=== Slides ===
  
* First lecture: Slides [http://www.cs.cmu.edu/~wcohen/10-605/workflow-1.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-605/workflow-1.pdf in PDF].
+
* First lecture: Slides [http://www.cs.cmu.edu/~wcohen/10-605/workflows-1.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-605/workflows-1.pdf in PDF].
* Second lecture: Slides [http://www.cs.cmu.edu/~wcohen/10-605/workflow-2.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-605/workflow-2.pdf in PDF].
+
* Second lecture: Slides [http://www.cs.cmu.edu/~wcohen/10-605/workflows-2.pptx in Powerpoint], [http://www.cs.cmu.edu/~wcohen/10-605/workflows-2.pdf in PDF].
 +
* Third lecture: Slides [http://www.cs.cmu.edu/~wcohen/10-605/workflows-3.pptx in Powerpoint],  [http://www.cs.cmu.edu/~wcohen/10-605/workflows-3.pdf in PDF].
  
=== Quiz ===
+
* Catchup on simjoins: Slides [http://www.cs.cmu.edu/~wcohen/10-605/simjoins-catchup.pptx in Powerpoint],  [http://www.cs.cmu.edu/~wcohen/10-605/simjoins-catchup.pdf in PDF].
  
* [https://qna-app.appspot.com/edit_new.html#/pages/view/aglzfnFuYS1hcHByGQsSDFF1ZXN0aW9uTGlzdBiAgICwmqv_Cww Quiz] for first lecture.
+
=== Quizzes ===
 +
 
 +
* [https://qna.cs.cmu.edu/#/pages/view/170 Quiz for first lecture]
 +
* [https://qna.cs.cmu.edu/#/pages/view/175 Quiz for second lecture]
 +
* [https://qna.cs.cmu.edu/#/pages/view/178 Quiz for third lecture]
  
 
=== Readings ===
 
=== Readings ===
  
 
* Pig: none required.  A nice on-line resource for PIG is the on-line version of the O'Reilly Book [http://chimera.labs.oreilly.com/books/1234000001811/index.html Programming Pig].
 
* Pig: none required.  A nice on-line resource for PIG is the on-line version of the O'Reilly Book [http://chimera.labs.oreilly.com/books/1234000001811/index.html Programming Pig].
 
+
*Optional: [http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html Introduction to Information Retrieval], by Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütz, has a fairly [http://nlp.stanford.edu/IR-book/html/htmledition/scoring-term-weighting-and-the-vector-space-model-1.html self-contained chapter on the vector space model], including Rocchio's method.
=== Readings for the Class ===
 
*[http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html Introduction to Information Retrieval], by Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütz, has a fairly [http://nlp.stanford.edu/IR-book/html/htmledition/scoring-term-weighting-and-the-vector-space-model-1.html self-contained chapter on the vector space model], including Rocchio's method.
 
  
 
=== Also discussed ===
 
=== Also discussed ===
Line 26: Line 29:
  
 
* The TFIDF representation for documents.
 
* The TFIDF representation for documents.
* The Rocchio algorithm.
+
* What dataflow languages are, what sort of abstract operations they use, and what the complexity of these operations is.
* Why Rocchio is easy to parallelize.
+
* How joins are implemented in dataflow (and the difference between map-side and reduce-side joins)
 
+
* What the PageRank algorithm is
 +
* Common ways of representing graphs in map-reduce system
 +
** A list of edges
 +
** A list of nodes with outlinks
 +
* Why iteration is often expensive in pure dataflow algorithms.
 +
* How Spark differs from and/or is similar to other dataflow algorithms
 +
** Actions/transformations
 +
** RDDs
 +
** Caching
 
* Definition of a similarity join/soft join.
 
* Definition of a similarity join/soft join.
 
* Why inverted indices make TFIDF representations useful for similarity joins
 
* Why inverted indices make TFIDF representations useful for similarity joins
 
** e.g., whether high-IDF words have shorter or longer indices, and more or less impact in a similarity measure
 
** e.g., whether high-IDF words have shorter or longer indices, and more or less impact in a similarity measure

Latest revision as of 12:37, 19 September 2017

This is one of the class meetings on the schedule for the course Machine Learning with Large Datasets 10-605 in Fall_2017.

Slides

Quizzes

Readings

Also discussed

Things to Remember

  • The TFIDF representation for documents.
  • What dataflow languages are, what sort of abstract operations they use, and what the complexity of these operations is.
  • How joins are implemented in dataflow (and the difference between map-side and reduce-side joins)
  • What the PageRank algorithm is
  • Common ways of representing graphs in map-reduce system
    • A list of edges
    • A list of nodes with outlinks
  • Why iteration is often expensive in pure dataflow algorithms.
  • How Spark differs from and/or is similar to other dataflow algorithms
    • Actions/transformations
    • RDDs
    • Caching
  • Definition of a similarity join/soft join.
  • Why inverted indices make TFIDF representations useful for similarity joins
    • e.g., whether high-IDF words have shorter or longer indices, and more or less impact in a similarity measure