Guide for Happy Hadoop Hacking
Writing programs to process large amounts of data is notoriously difficult. The difference between successful algorithms on large and small data is night and day. You could wait a weeks to sort terabytes of information on your server. Or you could sort 1.42 TB every minute with an industrial sized Hadoop cluster (check out http://sortbenchmark.org/).
Similarity, there is a stark contrast between effective and ineffective development strategies when working with large datasets. In the rest of this document, we will describe a general workflow that will assist in your rapid development of correct big data crunching programs.
Development Workflow
How long does it take to start up a 10 node cluster on Amazon AWS? A lot longer than it does to execute a Hadoop unittest on your Map and Reduce functions. How expensive is a NullPointerException? Either real money or a second; depends if it’s on AWS or in a unittest.
Executing your freshly written Map and Reduce functions on the full data with a 100 node Elastic MapReduce (EMR) cluster is a terrible idea. Not only will it take about 10 minutes or more for AWS to allocate the computing resources, chances are that your program will fail and you will be out real dollars. In developing programs for Hadoop, it’s of the utmost importance to work from small to big. You will save lots of time by having a tight development loop where you test your code on ever increasing amounts of data to ensure correctness, proper error handling, and scalability.
Here’s a list of ideas and techniques that you should keep in mind during your Hadoop development:
Hadoop unit tests
- Use MRUnit (https://cwiki.apache.org/confluence/display/MRUNIT/MRUnit+Tutorial)
- Tests should be as fast as possible, ideally run each time you make a version control check-in (always use VCS!)
- Use to verify correctness of Map and Reduce functions on sensible data. Also use to verify that you handle bad or improperly formatted data correctly.
- Either keep the tiny amount of data necessary for unittest in hard-coded relative path in project or hardcoded into program (e.g. static string).
- Requires hadoop-core-1.2.1.jar and mrunit-1.0.0-hadoop1.jar to be in Classpath
Local development
- Run your map and reduce funtions on a small (~1-5 MB) random sample of data (e.g. cat sample | ./map | ./reduce -- Unix pipes effectively simulate Hadoop streaming).
- Compile, run, see results loop should be no more than 15 seconds (decrease datasize to make sure this dev loop is fast!).
- Use this test to see how your Map & Reduce functions work on real data. You’ll do most of your debugging in this phase.
- Requires hadoop-core-1.2.1.jar to be on classpath
Gates Cluster (a general Hadoop cluster with SSH access)
- Run your map and reduce functions in Hadoop streaming mode on a larger (~10% of total) random sample of data
- Should take anywhere from 5-15 minutes to complete.
- Checks whether or not your code is working properly and scales well.
- Requires hadoop cluster
Amazon AWS (computing with real money!)
- Do not run your program on AWS first. Not only does AWS cost money, but it takes a lot of time to start up an EMR cluster. AWS charges users on a per-hour basis as well as rounding up to the nearest hour.
- If you spin up a cluster only to encounter a NullPointerException in the first minute of execution, you’ll be charged for the entire hour.
- Make your first run on a small sample of data on a trivially sized cluster (1 master 2 workers, use the micro instances as they are free). You want to make sure that everything is OK with the AWS configuration.
- Once everything checks out, and you are reasonably sure that things will scale well (due to your gates cluster test), run on the full data.
- Make sure to terminate your cluster when you’re done with it! (In the 10-605 lore, one student racked up in excess of $3,000, all of which went to the student’s personal credit card!)