Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Info
titleINFO

EMR is an Amazon service that lets you run use cases on single-purpose short lived clusters that automatically scale to meet demand, or on long running highly available clusters using multi-master deployment mode. The ability to expand or shrink hardware processing hours based on your needs is useful for scheduling jobs that require resources that are limited or unavailable at busy times, and for ad hoc workloads with fluctuating resource requirements.

...

false

Note
icon


Table of Contents

Setting up Datameer on EMR

Info
titleAs of Datameeer 7.5

As of version 7.5, Datameer supports Hive within EMR 5.24 (and newer). if you require more specific information about Hive integration please contact your Datameer service team member.

Table of Contents

Setting up Datameer on EMR

You must log on with Datameer Administrator privileges to set up EMR.

In the Admin tab, select Hadoop Cluster. The current configurations for your Hadoop cluster are displayed. Click Edit at the bottom.

Select EMR Hadoop Cluster from the drop down menu under Cluster Mode.

...

INFO

To set up Datameer on Amazon EMR, you have to be an administrator.

To set up Datameer: 

  1. Open the Admin Tab and select "Hadoop Cluster"The current configuration for your Hadoop cluster is displayed. 
    Image Added 
  2. Click "Edit"The configuration page opens.
    Image Added 
  3. Select "EMR Hadoop Cluster" from the drop-down under section 'Cluster Mode'.
    Image Added
  4. Enter your Amazon S3 bucket address and the path to the storage folder. 
    INFO: Datameer uses S3 as storage for all files, both permanent and intermediate, for additional security.

...

  1. Image Added
  2. If needed, activate the check box "Use EC2 IAM Role?" to authenticate via IAM role.

Image Removed

Image Removed

...

  1. Image Added
  2. If needed, authenticate to your S3 bucket using your key/ secret.
    Image Added
  3. Select the mode for connecting to the EMR Cluster

...

  1. from the drop-down and set the configuration. 
    INFO: EMR Cluster Name

...

  1. and EMR Cluster ID are validated when saving the configuration.
    INFO: With EMR Cluster Name mode

...

  1. , enter the name of the cluster running EMR.

...

  1.  Set the polling interval time in seconds for Datameer to check if there is a cluster with the name entered above.

...

  1. Image Added
    INFO: With EMR Cluster Id mode

...

  1. , enter the ID of the cluster running EMR.

...

  1. Image Added
    INFO: With YARN Resource Manager mode you can provide the EMR Cluster master node hostname directly.

Image Removed

...

  1. Image Added
  2. Configure the default property values or enter additional Hadoop distribution specific properties as well as custom properties. 

Image Removed

...

  1. Image Added
  2. Select the severity of messages to be logged and confirm with "Save". Configuring the EMR Cluster is finished. 
    INFO:The logging customization field allows you to record exactly what is needed.

Image Removed

Click Save to complete the EMR Cluster setup.

  1. Image Added

Security

Anchor
s3_auth
s3_auth
S3

...

Authentication

When configuring the EMR Hadoop Cluster, you are presented with two options for authenticating to S3:

  • Access Key/Secret
  • IAM Role

Access

...

Key/

...

Secret

Datameer uses the Amazon S3 REST API which in turn uses a custom HTTP scheme based on a keyed-HMAC (Hash Message Authentication Code) for authentication. To authenticate a request, you first concatenate selected elements of the request to form a string. You then use your AWS secret access key to calculate the HMAC of that string. The output of the HMAC algorithm is the signature. It simulates the security properties of a real signature. This signature is added to the request in the standard HTTP Authorization header using the syntax "Authorization: AWS AWSAccessKeyId:Signature".

When the system receives an authenticated request, it fetches the AWS secret access key that you claim to have and uses it in the same way to compute a signature for the message it received. It then compares the signature it calculated against the signature presented by the requester. If the two signatures match, the system concludes that the requester must have access to the AWS secret access key and therefore acts with the authority of the principal to whom the key was issued. If the two signatures do not match, the request is dropped and the system responds with an error message.

IAM

...

Role

IAM roles provide a convenient alternative to using access key/secret for authenticating to S3 from Amazon EC2 instances. When this option is selected, Datameer's S3 client uses the instance profile credentials to sign and authenticate the S3 requests. Instance profile credentials exist within the instance metadata associated with the IAM role for the EC2 instance. The EC2 instance on which Datameer runs is launched with the appropriate IAM role/instance profile. The same is used for launching the EMR Cluster. It is usually sufficient to use the default EC2 instance profile, EMR_EC2_DefaultRole, to launch both the EMR Cluster and the Datameer EC2 instance. The EMR instance, EC2 instance and S3 Bucket must be in the same AWS Region.

...

Datameer uses S3 as storage for both permanent and intermediate files. Datameer does not write any intermediate or cached data locally on the cluster or to HDFS. The following diagram gives a high-level overview of supported encryption mechanisms.

Encrypting

...

Data at

...

REST

Datameer supports encrypting data at rest on S3. The following server-side encryption mechanisms for S3 are supported within Datameer. Datameer does not support Amazon S3 Client-Side Encryption.

...

On the Datameer side, you need to specify the custom property das.fs.s3-bucket.encryption.type=KMS. When this property is present, Datameer uses the appropriate encryption header in all S3 requests to ensure encryption for all objects that are stored in your bucket. Both authentication mechanisms, instance profile and access key/secret based credentials, are supported.

Encrypting

...

Data In

...

Transit

Datameer implicitly supports encryption mechanisms for data in transit that are supported by Amazon EMR and S3. The encryption mechanisms are EMR release version and application (e.g., Hadoop, Tez, etc.) specific, and do not require any special handling in Datameer other than configuration.

...

To enforce in-transit encryption for all calls from the client browser to the Datameer app server, SSL must be enabled in Jetty. If using a custom certificate, install it on the Datameer ec2 instance. These instructions are provided in the Datameer's Installation Guide.

Using the REST API for EMR

Datameer's REST API is available to view and update EMR configurations. See Datameer's EMR REST API.