Difference between revisions of "Amazon Elastic MapReduce information"

From Cohen Courses
Jump to navigationJump to search
 
(10 intermediate revisions by 2 users not shown)
Line 10: Line 10:
 
account, you can just use that password to log into AWS at
 
account, you can just use that password to log into AWS at
 
https://console.aws.amazon.com.
 
https://console.aws.amazon.com.
 +
 +
You should also sign up for [https://aws.amazon.com/education/awseducate/ Amazon Educate].  I believe that this step is not strictly necessary but may help you if you need support, e.g., if you overspend your funds.
  
 
=== Installing and configuring the command-line tool on your local machine ===
 
=== Installing and configuring the command-line tool on your local machine ===
Line 25: Line 27:
 
   % export PATH=$PATH:~/code/aws-cli/install/bin/
 
   % export PATH=$PATH:~/code/aws-cli/install/bin/
  
 +
(Known issue: this requires Python 2.6 or newer.)
 
To test this, type <code>aws --version</code> at the command prompt.
 
To test this, type <code>aws --version</code> at the command prompt.
  
 
3) Next, you need to get your '''access key'''.  An "access key" is a set of codes, one
 
3) Next, you need to get your '''access key'''.  An "access key" is a set of codes, one
 
private, and one public, that are used to interact with the AWS CLI
 
private, and one public, that are used to interact with the AWS CLI
tool.  Follow the directions [here https://console.aws.amazon.com/iam/home?#security_credential], and  
+
tool.  Follow the directions [http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSGettingStartedGuide/AWSCredentials.html here], and  
save the result somewhere safe and private.
+
save the result somewhere safe and private. ('''Question: is this right? do you need to create a username also?''' --[[User:Wcohen|Wcohen]] ([[User talk:Wcohen|talk]]) 15:45, 31 August 2015 (EDT))
 +
 
  
 
4) Then you need to tell the AWS CLI about your access codes.  The command for this is
 
4) Then you need to tell the AWS CLI about your access codes.  The command for this is
Line 40: Line 44:
 
   AWS Secret Access Key [None]:  ...
 
   AWS Secret Access Key [None]:  ...
 
   Default region name [None]: us-east-1
 
   Default region name [None]: us-east-1
D efault output format [None]: json
+
  Default output format [None]: json
  
 
This info is stored somewhere off your home directory by the AWS CLI tool.
 
This info is stored somewhere off your home directory by the AWS CLI tool.
  
=== Generate a key-pair file and security group to provide to your clusters ===
+
=== Generate a key-pair file, security group, and service role to provide to your clusters ===
  
 
5) Create a key-pair. You'd think one set of codes would be enough,
 
5) Create a key-pair. You'd think one set of codes would be enough,
Line 51: Line 55:
 
details are [http://docs.aws.amazon.com/cli/latest/userguide/cli-ec2-keypairs.html here]
 
details are [http://docs.aws.amazon.com/cli/latest/userguide/cli-ec2-keypairs.html here]
 
but the quick version is to use these commands (the second keeps the keys secret).
 
but the quick version is to use these commands (the second keeps the keys secret).
 
 
   % aws ec2 create-key-pair --key-name MyKeyPair --query 'KeyMaterial' --output text > MyKeyPair.pem
 
   % aws ec2 create-key-pair --key-name MyKeyPair --query 'KeyMaterial' --output text > MyKeyPair.pem
 
   % chmod 600 MyKeyPair.pem
 
   % chmod 600 MyKeyPair.pem
 
 
6) Create a security group.  This one will let any IP address try ssh
 
6) Create a security group.  This one will let any IP address try ssh
 
into your cluster (but I believe they need the keypair you use at
 
into your cluster (but I believe they need the keypair you use at
 
creation time to be successful).  You can specify a range of IPs here instead if
 
creation time to be successful).  You can specify a range of IPs here instead if
 
you want.
 
you want.
 
 
   % aws ec2 create-security-group --group-name MySecurityGroup --description "My security group"
 
   % aws ec2 create-security-group --group-name MySecurityGroup --description "My security group"
 
   % aws ec2 authorize-security-group-ingress --group-name MySecurityGroup --protocol tcp --port 22 --cidr 0.0.0.0/0
 
   % aws ec2 authorize-security-group-ingress --group-name MySecurityGroup --protocol tcp --port 22 --cidr 0.0.0.0/0
 +
7) Create a service role. Full details are [http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-iam-roles-creatingroles.html available] but here's a short version. It will automatically create a reasonable role and populate your AWS CLI config file with the role & instance profile information.
 +
  % aws emr create-default-roles
  
 
== Creating and using a Cluster ==
 
== Creating and using a Cluster ==
  
7) Create a cluster.  You only need to do steps 1-6 once (for each
+
All the other operations you do can use only the <code>emr</code> subcommand
machine you want to work from anyway) and after that, you can create a
+
of the CLI, whichis documented [http://docs.aws.amazon.com/cli/latest/reference/emr/index.html here.]
 +
 
 +
8) Create a cluster.  You only need to do steps 1-6 once (although each machine you
 +
want to work from will need its own config file) and after that, you can create a
 
cluster with just one more command.  This command is very customizable but
 
cluster with just one more command.  This command is very customizable but
 
one that works would be
 
one that works would be
  
 
   % aws emr create-cluster --ami-version 3.8.0  --ec2-attributes KeyName=MyKeyPair \
 
   % aws emr create-cluster --ami-version 3.8.0  --ec2-attributes KeyName=MyKeyPair \
 +
    --use-default-roles \
 
     --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge \
 
     --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge \
 
     InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge \
 
     InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge \
  
 
The instance-groups stuff defines the cluster you want - this one is
 
The instance-groups stuff defines the cluster you want - this one is
tiny, with three nodes.  The KeyName, which should have the name of
+
tiny, with three nodes.  MyKeyPair, which should be the name of
 
the keypair you created in step 5, is how the new cluster will know
 
the keypair you created in step 5, is how the new cluster will know
 
whether or not to let you in.  This will output something like:
 
whether or not to let you in.  This will output something like:
Line 92: Line 99:
 
the [https://console.aws.amazon.com/elasticmapreduce/home?region=us-east-1#cluster-list: MR console home page].  There's a good bit of useful information on this page as well, if you poke around.  
 
the [https://console.aws.amazon.com/elasticmapreduce/home?region=us-east-1#cluster-list: MR console home page].  There's a good bit of useful information on this page as well, if you poke around.  
  
8) Once the cluster starts up, Amazon begins charging your account for its existence.
+
9) Once the cluster starts up, Amazon begins charging your account for its existence.
 
So use your cluster right away - and then - when you are all done - TERMINATE IT.
 
So use your cluster right away - and then - when you are all done - TERMINATE IT.
 
The meter keeps running until you do!  You can terminate from the command line  
 
The meter keeps running until you do!  You can terminate from the command line  
 
or from the console web page.
 
or from the console web page.

Latest revision as of 11:01, 7 September 2016

MR (Elastic MapReduce) is a popular cloud processing service from Amazon that includes Hadoop. Running Guinea Pig on EMR is easy enough, but there are lots of steps. This is a walkthrough.

Initial setup steps

Setting up your AWS account

1) First you need to get an Amazon AWS account. If you have an Amazon account, you can just use that password to log into AWS at https://console.aws.amazon.com.

You should also sign up for Amazon Educate. I believe that this step is not strictly necessary but may help you if you need support, e.g., if you overspend your funds.

Installing and configuring the command-line tool on your local machine

2) Install the tools: You need to establish the credentials you need to use EC2, the "Elastic Cloud" service that includes EMR, and also use EC2 to launch new virtual clusters in EMR. I use a command-line program (aka a "CLI") to do this. So first, install that program, the AWS CLI. The details are here, but briefly, go to a convenient directory, say ~/code/aws-cli, and type

 % curl https://s3.amazonaws.com/aws-cli/awscli-bundle.zip > awscli-bundle.zip
 % unzip awscli-bundle.zip
 % ./awscli-bundle/install -i `pwd`/install
 % export PATH=$PATH:~/code/aws-cli/install/bin/

(Known issue: this requires Python 2.6 or newer.) To test this, type aws --version at the command prompt.

3) Next, you need to get your access key. An "access key" is a set of codes, one private, and one public, that are used to interact with the AWS CLI tool. Follow the directions here, and save the result somewhere safe and private. (Question: is this right? do you need to create a username also? --Wcohen (talk) 15:45, 31 August 2015 (EDT))


4) Then you need to tell the AWS CLI about your access codes. The command for this is 'aws configure': you'll be asked for your codes and some other info, and I used these:

 % aws configure
 AWS Access Key ID [None]: ...
 AWS Secret Access Key [None]:  ...
 Default region name [None]: us-east-1
 Default output format [None]: json

This info is stored somewhere off your home directory by the AWS CLI tool.

Generate a key-pair file, security group, and service role to provide to your clusters

5) Create a key-pair. You'd think one set of codes would be enough, but you're not done yet; you need another set of public/private codes called a "keypair" to interact with the clusters you create. The details are here but the quick version is to use these commands (the second keeps the keys secret).

 % aws ec2 create-key-pair --key-name MyKeyPair --query 'KeyMaterial' --output text > MyKeyPair.pem
 % chmod 600 MyKeyPair.pem

6) Create a security group. This one will let any IP address try ssh into your cluster (but I believe they need the keypair you use at creation time to be successful). You can specify a range of IPs here instead if you want.

 % aws ec2 create-security-group --group-name MySecurityGroup --description "My security group"
 % aws ec2 authorize-security-group-ingress --group-name MySecurityGroup --protocol tcp --port 22 --cidr 0.0.0.0/0

7) Create a service role. Full details are available but here's a short version. It will automatically create a reasonable role and populate your AWS CLI config file with the role & instance profile information.

 % aws emr create-default-roles

Creating and using a Cluster

All the other operations you do can use only the emr subcommand of the CLI, whichis documented here.

8) Create a cluster. You only need to do steps 1-6 once (although each machine you want to work from will need its own config file) and after that, you can create a cluster with just one more command. This command is very customizable but one that works would be

 % aws emr create-cluster --ami-version 3.8.0  --ec2-attributes KeyName=MyKeyPair \
   --use-default-roles \
   --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge \
    InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge \

The instance-groups stuff defines the cluster you want - this one is tiny, with three nodes. MyKeyPair, which should be the name of the keypair you created in step 5, is how the new cluster will know whether or not to let you in. This will output something like:

 {
     "ClusterId": "j-JEX5UT60ELD5"
 }

which is the name of the cluster. It will take some time (10min?) to start up and then you can log into the master using your keypair as follows, of course using the cluster id of your current cluster:

 % aws emr ssh --cluster-id j-JEX5UT60ELD5 --key-pair-file MyKeyPair.pem

You can monitor status by watching (and periodically re-loading) the MR console home page. There's a good bit of useful information on this page as well, if you poke around.

9) Once the cluster starts up, Amazon begins charging your account for its existence. So use your cluster right away - and then - when you are all done - TERMINATE IT. The meter keeps running until you do! You can terminate from the command line or from the console web page.