Difference between revisions of "Class meeting for 10-405 Workflows For Hadoop"
From Cohen Courses
Jump to navigationJump to search (Created page with "This is one of the class meetings on the schedule for the course Machine Learning with Large Data...") |
|||
(7 intermediate revisions by the same user not shown) | |||
Line 26: | Line 26: | ||
=== Things to Remember === | === Things to Remember === | ||
− | * | + | * Combiners and how/when they improve efficiency |
* What dataflow languages are, what sort of abstract operations they use, and what the complexity of these operations is. | * What dataflow languages are, what sort of abstract operations they use, and what the complexity of these operations is. | ||
− | * How joins are implemented in dataflow | + | * How joins are implemented in dataflow |
+ | ** The difference between map-side and reduce-side joins and how they are implemented | ||
+ | ** When to use map-side vs reduce-side joins | ||
+ | * Definition of a similarity join/soft join. | ||
+ | |||
+ | * Complexity of operations like similarity join, TFIDF computation, etc. | ||
+ | |||
* What the PageRank algorithm is | * What the PageRank algorithm is | ||
* Common ways of representing graphs in map-reduce system | * Common ways of representing graphs in map-reduce system | ||
Line 38: | Line 44: | ||
** RDDs | ** RDDs | ||
** Caching | ** Caching | ||
− | * | + | |
− | + | * How to implement k-means in a map-reduce setting with dataflow | |
− | ** | + | ** Not discussed in class, but in the slide deck! |
Latest revision as of 11:12, 5 March 2018
This is one of the class meetings on the schedule for the course Machine Learning with Large Datasets 10-405 in Spring 2018.
Slides
- First lecture: Slides in Powerpoint, in PDF.
- Second lecture: Slides in Powerpoint, in PDF.
- Third lecture: Slides in Powerpoint, in PDF.
Quizzes
Readings
- Pig: none required. A nice on-line resource for PIG is the on-line version of the O'Reilly Book Programming Pig.
- Optional: Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütz, has a fairly self-contained chapter on the vector space model, including Rocchio's method.
Also discussed
- Joachims, Thorsten, A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. Proceedings of International Conference on Machine Learning (ICML), 1997.
- Relevance Feedback in Information Retrieval, SMART Retrieval System Experiments in Automatic Document Processing, 1971, Prentice Hall Inc.
- Schapire et al, Boosting and Rocchio applied to text filtering, SIGIR 98.
Things to Remember
- Combiners and how/when they improve efficiency
- What dataflow languages are, what sort of abstract operations they use, and what the complexity of these operations is.
- How joins are implemented in dataflow
- The difference between map-side and reduce-side joins and how they are implemented
- When to use map-side vs reduce-side joins
- Definition of a similarity join/soft join.
- Complexity of operations like similarity join, TFIDF computation, etc.
- What the PageRank algorithm is
- Common ways of representing graphs in map-reduce system
- A list of edges
- A list of nodes with outlinks
- Why iteration is often expensive in pure dataflow algorithms.
- How Spark differs from and/or is similar to other dataflow algorithms
- Actions/transformations
- RDDs
- Caching
- How to implement k-means in a map-reduce setting with dataflow
- Not discussed in class, but in the slide deck!