Difference between revisions of "Class meeting for 10-405 Workflows For Hadoop"

From Cohen Courses
Jump to navigationJump to search
(Created page with "This is one of the class meetings on the schedule for the course Machine Learning with Large Data...")
 
Line 9: Line 9:
 
=== Quizzes ===
 
=== Quizzes ===
  
* [https://qna.cs.cmu.edu/#/pages/view/170 Quiz for first lecture]
+
* [https://qna.cs.cmu.edu/#/pages/view/244 Quiz for first lecture]
* [https://qna.cs.cmu.edu/#/pages/view/175 Quiz for second lecture]
 
* [https://qna.cs.cmu.edu/#/pages/view/178 Quiz for third lecture]
 
  
 
=== Readings ===
 
=== Readings ===

Revision as of 13:14, 28 January 2018

This is one of the class meetings on the schedule for the course Machine Learning with Large Datasets 10-405 in Spring 2018.

Slides

Quizzes

Readings

Also discussed

Things to Remember

  • The TFIDF representation for documents.
  • What dataflow languages are, what sort of abstract operations they use, and what the complexity of these operations is.
  • How joins are implemented in dataflow (and the difference between map-side and reduce-side joins)
  • What the PageRank algorithm is
  • Common ways of representing graphs in map-reduce system
    • A list of edges
    • A list of nodes with outlinks
  • Why iteration is often expensive in pure dataflow algorithms.
  • How Spark differs from and/or is similar to other dataflow algorithms
    • Actions/transformations
    • RDDs
    • Caching
  • Definition of a similarity join/soft join.
  • Why inverted indices make TFIDF representations useful for similarity joins
    • e.g., whether high-IDF words have shorter or longer indices, and more or less impact in a similarity measure