Set Up Datameer on YARN

Setting Up Datameer with YARN

Datameer can connect to YARN cluster, depending on the Hadoop distribution used.

An admin has complete configuration in the Hadoop Cluster configuration page. The Hadoop Cluster configuration page requests for different configurations for YARN cluster when compared with a classic MapReduce cluster.

Following are the new configurations required to be setup by a Datameer admin in Hadoop Cluster configuration page (the user interface suggests the default values to be used):

ConfigurationType
Resource Tracker Addresshost:port
Yarn Resource Manager Addresshost:port
Resource Manager Webapp Addresshost:port
Yarn Resource Manager Scheduler Addresshost:port
Yarn MR Job History Addresshost:port
Yarn Application Classpathstring: class path

While working with secure Kerberos mode the Hadoop Cluster configuration page asks for the YARN Principal as well.

Debugging options

Some options for better debugging:

By default the container logs are deleted once the container is released, so you never see the actual reason for the failing container. The log message provided in the node managers log is not sufficient.

For better and faster debugging set the below configuration in yarn-site.xml (on all nodes):

<property>
<name>yarn.nodemanager.delete.debug-delay-sec</name>
<value>-1</value>
</property>

values here can be any positive integer, -1 implies never delete

Restart node manager to reflect the changes. After this change you should see container logs in the yarn log folder.

Occasionally you can even look at launch_container.sh script to check the runtime configurations of the application container.

Common YARN Configurations

#Configs set by in one of your SX Lab Environments
das.execution-framework.small-job.max-records=1000000
das.execution-framework.small-job.max-uncompressed=1000000000
tez.am.container.reuse.enabled=true
tez.am.launch.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_COMMON_HOME/lib/native/
tez.task.launch.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_COMMON_HOME/lib/native/
#das.tez.reduce-tasks-per-node=6
#tez.shuffle-vertex-manager.desired-task-input-size=50485760
das.tez.session-pool.max-cached-sessions=3
das.tez.session-pool.max-idle-time=600s
tez.am.session.min.held-containers=1
tez.am.container.idle.release-timeout-min.millis=60000
tez.am.container.idle.release-timeout-max.millis=60000
das.execution-framework=Smart
das.map-tasks-per-node=16


### We've seen this in virtual environments
mapreduce.map.memory.mb=1536
mapreduce.map.java.opts=-Xmx3226m
mapreduce.reduce.memory.mb=2240
mapreduce.reduce.java.opts=-Xmx4704m
yarn.app.mapreduce.am.command-opts=-Xmx2016m
yarn.scheduler.minimum-allocation-vcores=1
yarn.scheduler.maximum-allocation-vcores=8
yarn.nodemanager.resource.memory-mb=13440
yarn.scheduler.minimum-allocation-mb=1024
yarn.app.mapreduce.am.resource.cpu-vcores=1
mapreduce.reduce.cpu.vcores=1
yarn.nodemanager.resource.cpu-vcores=8
yarn.app.mapreduce.am.resource.mb=2240
yarn.scheduler.maximum-allocation-mb=13440
mapreduce.map.cpu.vcores=1
mapreduce.task.io.sort.mb=896

 
#Smaller jobs got better performance with customer FLS Connect with this setting:
tez.shuffle-vertex-manager.desired-task-input-size=52428800

Configuring Datameer When Upgrading from MR1 to MR2

Complete the following steps to configure Datameer when upgrading from MR1 to MR2:

  1. Stop Datameer.
  2. Take a dap database backup.
  3. Execute the following query on the dap database: 

    UPDATE property SET value = "LOCAL" WHERE name = "hadoop.mode";
  4. Start Datameer.
  5. Login and configure the MapR 4 MR2 settings on the Admin > Hadoop Cluster tab.

Configuring YARN with High Availability

To configure YARN with High Availability, see Configuring Datameer.