Class meeting for 10-405 Hadoop Overview
From Cohen Courses
This is one of the class meetings on the schedule for the course Machine Learning with Large Datasets 10-405 in Spring 2018.
Slides
Map-reduce overview:
Quiz
Readings for the Class
- There are lots of on-line tutorials for Hadoop. The O'Reilly Book is also quite good. You might also look at this annotated log of me interacting with streaming Hadoop.
Things to Remember
- Hadoop terminology: HDFS, shards, job tracker, combiner, mapper, reducer, ...
- The primary phases of a map-reduce computation, and what happens in each
- Map
- Shuffle/sort
- Reduce
- Where data might be transmitted across the network
- How data is stored in Hadoop
- Consequences of large block size for streaming and storage efficiency