Amazon Elastic MapReduce information

From Cohen Courses
Jump to navigationJump to search

MR (Elastic MapReduce) is a popular cloud processing service from Amazon that includes Hadoop. Running Guinea Pig on EMR is easy enough, but there are lots of steps. This is a walkthrough.

Initial setup steps

Setting up your AWS account

1) First you need to get an Amazon AWS account. If you have an Amazon account, you can just use that password to log into AWS at https://console.aws.amazon.com.

Installing and configuring the command-line tool on your local machine

2) Install the tools: You need to establish the credentials you need to use EC2, the "Elastic Cloud" service that includes EMR, and also use EC2 to launch new virtual clusters in EMR. I use a command-line program (aka a "CLI") to do this. So first, install that program, the AWS CLI. The details are here, but briefly, go to a convenient directory, say ~/code/aws-cli, and type

 % curl https://s3.amazonaws.com/aws-cli/awscli-bundle.zip > awscli-bundle.zip
 % unzip awscli-bundle.zip
 % ./awscli-bundle/install -i `pwd`/install
 % export PATH=$PATH:~/code/aws-cli/install/bin/

(Known issue: this requires Python 2.6 or newer.) To test this, type aws --version at the command prompt.

3) Next, you need to get your access key. An "access key" is a set of codes, one private, and one public, that are used to interact with the AWS CLI tool. Follow the directions here, and save the result somewhere safe and private. (Question: is this right? do you need to create a username also? --Wcohen (talk) 15:45, 31 August 2015 (EDT))


4) Then you need to tell the AWS CLI about your access codes. The command for this is 'aws configure': you'll be asked for your codes and some other info, and I used these:

 % aws configure
 AWS Access Key ID [None]: ...
 AWS Secret Access Key [None]:  ...
 Default region name [None]: us-east-1
 Default output format [None]: json

This info is stored somewhere off your home directory by the AWS CLI tool.

Generate a key-pair file, security group, and service role to provide to your clusters

5) Create a key-pair. You'd think one set of codes would be enough, but you're not done yet; you need another set of public/private codes called a "keypair" to interact with the clusters you create. The details are here but the quick version is to use these commands (the second keeps the keys secret).

 % aws ec2 create-key-pair --key-name MyKeyPair --query 'KeyMaterial' --output text > MyKeyPair.pem
 % chmod 600 MyKeyPair.pem

6) Create a security group. This one will let any IP address try ssh into your cluster (but I believe they need the keypair you use at creation time to be successful). You can specify a range of IPs here instead if you want.

 % aws ec2 create-security-group --group-name MySecurityGroup --description "My security group"
 % aws ec2 authorize-security-group-ingress --group-name MySecurityGroup --protocol tcp --port 22 --cidr 0.0.0.0/0

7) Create a service role. Full details are available but here's a short version. It will automatically create a reasonable role and populate your AWS CLI config file with the role & instance profile information.

 % aws emr create-default-roles

Creating and using a Cluster

All the other operations you do can use only the emr subcommand of the CLI, whichis documented here.

8) Create a cluster. You only need to do steps 1-6 once (although each machine you want to work from will need its own config file) and after that, you can create a cluster with just one more command. This command is very customizable but one that works would be

 % aws emr create-cluster --ami-version 3.8.0  --ec2-attributes KeyName=MyKeyPair \
   --use-default-roles \
   --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge \
    InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge \

The instance-groups stuff defines the cluster you want - this one is tiny, with three nodes. MyKeyPair, which should be the name of the keypair you created in step 5, is how the new cluster will know whether or not to let you in. This will output something like:

 {
     "ClusterId": "j-JEX5UT60ELD5"
 }

which is the name of the cluster. It will take some time (10min?) to start up and then you can log into the master using your keypair as follows, of course using the cluster id of your current cluster:

 % aws emr ssh --cluster-id j-JEX5UT60ELD5 --key-pair-file MyKeyPair.pem

You can monitor status by watching (and periodically re-loading) the MR console home page. There's a good bit of useful information on this page as well, if you poke around.

9) Once the cluster starts up, Amazon begins charging your account for its existence. So use your cluster right away - and then - when you are all done - TERMINATE IT. The meter keeps running until you do! You can terminate from the command line or from the console web page.