Difference between revisions of "Class meeting for 10-405 Workflows For Hadoop"
From Cohen Courses
Jump to navigationJump to searchLine 31: | Line 31: | ||
** The difference between map-side and reduce-side joins and how they are implemented | ** The difference between map-side and reduce-side joins and how they are implemented | ||
** When to use map-side vs reduce-side joins | ** When to use map-side vs reduce-side joins | ||
+ | * Definition of a similarity join/soft join. | ||
+ | |||
+ | * Complexity of operations like similarity join, TFIDF computation, etc. | ||
+ | |||
* What the PageRank algorithm is | * What the PageRank algorithm is | ||
* Common ways of representing graphs in map-reduce system | * Common ways of representing graphs in map-reduce system | ||
Line 40: | Line 44: | ||
** RDDs | ** RDDs | ||
** Caching | ** Caching | ||
− | |||
* How to implement k-means in a map-reduce setting with dataflow | * How to implement k-means in a map-reduce setting with dataflow | ||
** Not discussed in class, but in the slide deck! | ** Not discussed in class, but in the slide deck! |
Latest revision as of 11:12, 5 March 2018
This is one of the class meetings on the schedule for the course Machine Learning with Large Datasets 10-405 in Spring 2018.
Slides
- First lecture: Slides in Powerpoint, in PDF.
- Second lecture: Slides in Powerpoint, in PDF.
- Third lecture: Slides in Powerpoint, in PDF.
Quizzes
Readings
- Pig: none required. A nice on-line resource for PIG is the on-line version of the O'Reilly Book Programming Pig.
- Optional: Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütz, has a fairly self-contained chapter on the vector space model, including Rocchio's method.
Also discussed
- Joachims, Thorsten, A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. Proceedings of International Conference on Machine Learning (ICML), 1997.
- Relevance Feedback in Information Retrieval, SMART Retrieval System Experiments in Automatic Document Processing, 1971, Prentice Hall Inc.
- Schapire et al, Boosting and Rocchio applied to text filtering, SIGIR 98.
Things to Remember
- Combiners and how/when they improve efficiency
- What dataflow languages are, what sort of abstract operations they use, and what the complexity of these operations is.
- How joins are implemented in dataflow
- The difference between map-side and reduce-side joins and how they are implemented
- When to use map-side vs reduce-side joins
- Definition of a similarity join/soft join.
- Complexity of operations like similarity join, TFIDF computation, etc.
- What the PageRank algorithm is
- Common ways of representing graphs in map-reduce system
- A list of edges
- A list of nodes with outlinks
- Why iteration is often expensive in pure dataflow algorithms.
- How Spark differs from and/or is similar to other dataflow algorithms
- Actions/transformations
- RDDs
- Caching
- How to implement k-means in a map-reduce setting with dataflow
- Not discussed in class, but in the slide deck!