Guide for Happy Hadoop Hacking

From Cohen Courses
Jump to navigationJump to search

Writing programs to process large amounts of data is notoriously difficult. The difference between successful algorithms on large and small data is night and day. You could wait a weeks to sort terabytes of information on your server. Or you could sort 1.42 TB every minute with an industrial sized Hadoop cluster (check out http://sortbenchmark.org/).

Similarity, there is a stark contrast between effective and ineffective development strategies when working with large datasets. In the rest of this document, we will describe a general workflow that will assist in your rapid development of correct big data crunching programs.


Development Workflow

How long does it take to start up a 10 node cluster on Amazon AWS? A lot longer than it does to execute a Hadoop unittest on your Map and Reduce functions. How expensive is a NullPointerException? Either real money or a second; depends if it’s on AWS or in a unittest.

Executing your freshly written Map and Reduce functions on the full data with a 100 node Elastic MapReduce (EMR) cluster is a terrible idea. Not only will it take about 10 minutes or more for AWS to allocate the computing resources, chances are that your program will fail and you will be out real dollars. In developing programs for Hadoop, it’s of the utmost importance to work from small to big. You will save lots of time by having a tight development loop where you test your code on ever increasing amounts of data to ensure correctness, proper error handling, and scalability.

Here’s a list of ideas and techniques that you should keep in mind during your Hadoop development

  • world }
Hadoop unit tests row 1, cell 2
row 2, cell 1 row 2, cell 2 row 2, cell 3