Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data.
Important: This article applies to the Enterprise plan (with AWS Cloud) only.
In this step, you launch the Amazon EMR cluster by using the Amazon EMR console or the AWS CLI.
You can launch the Amazon EMR cluster using the Amazon EMR console and the AWS CLI.
Sign in to the AWS Management Console and open the Amazon EMR console.
Choose Create cluster.
Click Go to advanced options.
On the Software and Steps step, accept the default values except for the following fields and choose Next. For more information, see Configure Cluster Software.
On the Hardware step, accept the default values except for the following fields and choose Next. For more information, see Configure Cluster Hardware and Networking.
On the General Cluster Settings step, add the bootstrap action to configure the Amazon EMR cluster to work with ZEPL and choose Next.
Notes: In the case you don't want to add the bootstrap action, you can configure the Amazon EMR cluster manually. For more information, see Configure the Amazon EMR cluster.
On the Security step, accept the default values except for the following fields and choose Create cluster.
Proceed to the next step.
To launch a cluster to work with ZEPL, type the following command, replace myKey with the name of your EC2 key pair and subnet-xxxxx with the VPC subnet in which to create the cluster. For more information, see AWS CLI Command Reference.
$ aws emr create-cluster --name "ZEPL cluster" --release-label emr-5.14.0 \ --use-default-roles --ec2-attributes KeyName=myKey,SubnetId=subnet-xxxxx \ --applications Name=Hadoop Name=Hive Name=Spark \ --instance-count 3 --instance-type m3.xlarge \ --bootstrap-actions Path="s3://zepl/emr/bootstrap.sh"
Notes: If you created a cluster with the bootstrap action to configure the cluster to work with ZEPL, you can skip this step.
To configure the Amazon EMR cluster, you should connect to the master node using SSH.
On the SSH console to the master node, type the following command to configure the cluster to work with ZEPL:
$ hadoop fs -get s3://zepl/emr/bootstrap.sh /usr/tmp/ $ chmod +x /usr/tmp/bootstrap.sh $ /usr/tmp/bootstrap.sh
To connect to the Amazon EMR from ZEPL, you should create the designated resource for Amazon EMR cluster.
You should select the image exactly corresponding with your EMR cluster. For more information, see Docker Images Provided by ZEPL.
ZEPL manages the following Docker images that are available when creating the resources.
Notes: If you do not find your image on this page, it most likely contains components that are not supported by ZEPL yet. Please contact ZEPL to get support for your configuration.
|EMR Release||Application Versions|
|5.12.1||Hadoop 2.8.3, Spark 2.2.1, Hive 2.3.2|
|5.14.0||Hadoop 2.8.3, Spark 2.3.0, Hive 2.3.2|
In ZEPL, create a new notebook and select the EMR resource which is created in the previous step. Once it is created, you should configure the Spark interpreter to connect to the Amazon EMR cluster.
To connect to the Amazon EMR cluster from ZEPL, you need the public DNS name of the master node. You can retrieve the master public DNS name using the Amazon EMR console and the AWS CLI.
In the notebook, you can configure the Spark interpreter by adding following paragraph:
Notes: you should replace ip-10-0-1-56.ap-northeast-1.compute.internal with the public DNS name of the master node.
To access the data in Amazon S3, you should set the AWS access key ID and secret access key.
In the notebook, use following commands to set the AWS credentials:
%spark sc.hadoopConfiguration.set("fs.s3a.access.key", "<AWS-ACCESS-KEY-ID>") sc.hadoopConfiguration.set("fs.s3a.secret.key", "<AWS-SECRET-ACCESS-KEY>")
You should use
s3:// prefix to access the data in Amazon S3. For more information, see Work with Storage and File Systems.
Amazon EMR uses YARN (Yet Another Resource Negotiator), which is a component introduced in Apache Hadoop 2.0 to centrally manage cluster resources for multiple data-processing frameworks.
For more information, see View Web Interfaces Hosted on Amazon EMR Clusters.