Amazon Elastic MapReduce information

From Cohen Courses
Jump to navigationJump to search

MR (Elastic MapReduce) is a popular cloud processing service from Amazon that includes Hadoop. Running Guinea Pig on EMR is easy enough, but there are lots of steps. This is a walkthrough.

Setting up your AWS account

1) First you need to get an Amazon AWS account. If you have an Amazon account, you can just use that password to log into AWS at https://console.aws.amazon.com.

Installing and configuring the command-line tool on your local machine

2) Install the tools: You need to establish the credentials you need to use EC2, the "Elastic Cloud" service that includes EMR, and also use EC2 to launch new virtual clusters in EMR. I use a command-line program (aka a "CLI") to do this. So first, install that program, the AWS CLI. The details are here, but briefly, go to a convenient directory, say ~/code/aws-cli, and type

 % curl https://s3.amazonaws.com/aws-cli/awscli-bundle.zip > awscli-bundle.zip
 % unzip awscli-bundle.zip
 % ./awscli-bundle/install -i `pwd`/install
 % export PATH=$PATH:~/code/aws-cli/install/bin/

To test this, type aws --version at the command prompt.

3) Next, you need to get your access key. An "access key" is a set of codes, one private, and one public, that are used to interact with the AWS CLI tool. Follow the directions [here https://console.aws.amazon.com/iam/home?#security_credential], and save the result somewhere safe and private.

4) Then you need to tell the AWS CLI about your access codes. The command for this is 'aws configure': you'll be asked for your codes and some other info, and I used these:

 % aws configure
 AWS Access Key ID [None]: ...
 AWS Secret Access Key [None]:  ...
 Default region name [None]: us-east-1
D efault output format [None]: json

This info is stored somewhere off your home directory by the AWS CLI tool.

Generate a key-pair file and security group to provide to your clusters

5) Create a key-pair. You'd think one set of codes would be enough, but you're not done yet; you need another set of public/private codes called a "keypair" to interact with the clusters you create. The details are here but the quick version is to use these commands (the second keeps the keys secret).

 % aws ec2 create-key-pair --key-name MyKeyPair --query 'KeyMaterial' --output text > MyKeyPair.pem
 % chmod 600 MyKeyPair.pem

6) Create a security group. This one will let any IP address try ssh into your cluster (but I believe they need the keypair you use at creation time to be successful). You can specify a range of IPs here instead if you want.

 % aws ec2 create-security-group --group-name MySecurityGroup --description "My security group"
 % aws ec2 authorize-security-group-ingress --group-name MySecurityGroup --protocol tcp --port 22 --cidr 0.0.0.0/0