Difference between revisions of "Class meeting for 10-405 Workflows For Hadoop"

From Cohen Courses
Jump to navigationJump to search
 
(One intermediate revision by the same user not shown)
Line 28: Line 28:
 
* Combiners and how/when they improve efficiency
 
* Combiners and how/when they improve efficiency
 
* What dataflow languages are, what sort of abstract operations they use, and what the complexity of these operations is.
 
* What dataflow languages are, what sort of abstract operations they use, and what the complexity of these operations is.
* How joins are implemented in dataflow (and the difference between map-side and reduce-side joins)
+
* How joins are implemented in dataflow
 +
** The difference between map-side and reduce-side joins and how they are implemented
 +
** When to use map-side vs reduce-side joins
 +
* Definition of a similarity join/soft join.
 +
 
 +
* Complexity of operations like similarity join, TFIDF computation, etc.
 +
 
 
* What the PageRank algorithm is
 
* What the PageRank algorithm is
 
* Common ways of representing graphs in map-reduce system
 
* Common ways of representing graphs in map-reduce system
Line 38: Line 44:
 
** RDDs
 
** RDDs
 
** Caching
 
** Caching
* Definition of a similarity join/soft join.
 
  
 
* How to implement k-means in a map-reduce setting with dataflow
 
* How to implement k-means in a map-reduce setting with dataflow
 
** Not discussed in class, but in the slide deck!
 
** Not discussed in class, but in the slide deck!

Latest revision as of 11:12, 5 March 2018

This is one of the class meetings on the schedule for the course Machine Learning with Large Datasets 10-405 in Spring 2018.

Slides

Quizzes

Readings

Also discussed

Things to Remember

  • Combiners and how/when they improve efficiency
  • What dataflow languages are, what sort of abstract operations they use, and what the complexity of these operations is.
  • How joins are implemented in dataflow
    • The difference between map-side and reduce-side joins and how they are implemented
    • When to use map-side vs reduce-side joins
  • Definition of a similarity join/soft join.
  • Complexity of operations like similarity join, TFIDF computation, etc.
  • What the PageRank algorithm is
  • Common ways of representing graphs in map-reduce system
    • A list of edges
    • A list of nodes with outlinks
  • Why iteration is often expensive in pure dataflow algorithms.
  • How Spark differs from and/or is similar to other dataflow algorithms
    • Actions/transformations
    • RDDs
    • Caching
  • How to implement k-means in a map-reduce setting with dataflow
    • Not discussed in class, but in the slide deck!