Checklist for Hadoop Administrators

This page describes only the minimal set of steps needed for Datameer to operate properly in your Hadoop environment. For a more complete guide to Hadoop cluster design, configuration and tuning, see Hadoop Cluster Configuration Tips or the additional resources section below.

User Identity

Your Hadoop cluster is accessed by Datameer as a particular Hadoop user. By default, the user is identified as the UNIX user who launched the Datameer application (equivalent to the UNIX command 'whoami'). To ensure this works properly, you should create a user of the same name within Hadoop's HDFS for Datameer to use exclusively for scheduling and configuration. This ensures the proper permissions are set, and that this user is recognized whenever the Datameer application interacts with your Hadoop cluster.

The username used to launch the Datameer application can be configured in <Datameer folder>/etc/das-env.sh

Permissions in HDFS

When running Datameer with an on-premise Hadoop cluster (called "distributed mode"), you need to define an area (folder) within HDFS for Datameer to store its private data. The Hadoop user created for Datameer should have read/write access to this folder.

Hadoop permissions can be set with the following commands. Note that permission checking can be enabled/disabled globally for HDFS. See Configuring Datameer in a Shared Hadoop Cluster for more information.

hadoop -fs
      [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
      [-chown [-R] [OWNER][:[GROUP]] PATH...]
      [-chgrp [-R] GROUP PATH...]

If you experience a problem with the permissions of the staging directory when submitting jobs other then with the superuser. There are two solutions for this:

  1. Configure mapreduce.jobtracker.staging.root.dir to /user in the mapred-site.xml of the cluster.
  2. Or, change the permissions of the hadoop.tmp.dir (usually /tmp) inside the hdfs to 777.

See the Hadoop documentation for additional information.

Local UNIX Permissions

Various components of your Hadoop and Datameer environment (data nodes, task trackers, JDBC drivers, the Datameer application server) use a folder for temporary storage or "scratch" space. Additionally, HDFS files are sometimes stored in this location by default (see the Hadooop documentation).

Datameer searches internally for the possible locations (folder/directory) that have the most free space in the local filesystem to write temporary files.

Possible options:

  • mapred.local.dir (Property is used in old MR API)
  • mapreduce.cluster.local.dir (Property for new MR API)
  • tez.runtime.framework.local.dirs (Property for TEZ mode)
  • yarn.nodemanager.local-dirs (Property for Yarn)
  • Java system temp folder

For each machine (the Datameer server, Hadoop master and slaves), you must ensure that the local UNIX user running these processes (often "hadoop" or "datameer", depending on the component) has read/write permissions for these folders, and of course to /tmp (commonly used by some components). Different components (e.g. Hive) can use different, configurable locations for scratch space. Check your configuration files for details.

In particular, Datameer makes use of the directory mapred.local.dir defined in your Hadoop configuration. Check that Datameer has write access to this folder, both on the Datameer server and cluster nodes. If not properly configured, you might see cryptic exceptions like the one reported here: https://issues.apache.org/jira/browse/MAPREDUCE-635

Make sure the Datameer application is consistently run by the same user, so that certain log files and other temporary data aren't written as root and then locked when Datameer is subsequently started by another user. This can lead to inconsistent states which are difficult to troubleshoot (without access to the log files).

Network Connectivity

Hadoop Clusters are often secured behind firewalls, which might or might not be configured to allow Hadoop direct access to all data sources. The systems containing data to be analyzed, including web servers, databases, data warehouses, hosted applications, message queues, mobile devices, etc. are often in remote locations and accessible through various specific TCP ports. Normally, Datameer is installed in the same LAN as Hadoop, and therefore suffers from the same connectivity issues, which can prevent you from using Datameer. Conversely, if Datameer is located further away from Hadoop, it might have connectivity to data but not the Hadoop NameNode and Job Tracker, for the same reasons. Most of these issues aren't anticipated early in the life cycle of a Hadoop project, and thus firewalls need to be reconfigured. Finally, the speed and latency of the network link between Datameer, Hadoop, and your data can severely affect system performance and the end user experience. For these reasons, it's important to thoroughly review all network connectivity requirements up front. 

Before you start using Datameer, check that the following network connectivity is available on the specified ports:

  1. Datameer to Hadoop NameNode client port: (normally 8020, 9000 or 54310)
  2. Datameer to Hadoop JobTracker client port: (normally 8021, 9001 or 54311)
  3. Datameer to Hadoop DataNodes (normally ports 50010 and 50020)
  4. (Recommended): Datameer to Hadoop Administration consoles (normally 50030 and 50070)

Depending on your configuration, the following connectivity might also be necessary:

  1. Importing/exporting via SFTP:
    1. Datameer to source (normally port 22)
    2. Hadoop slaves to source (normally port 22)
  2. Importing/linking/exporting to/from an RDBMS:
    1. Datameer to RDBMS via JDBC (various ports, see your DB documentation)
    2. Hadoop slaves to RDBMS via JDBC (various ports, see your DB documentation)
  3. Connecting to Hive:
    1. Hive Metastore (normally port 10000)
    2. Location of external Hive tables, if applicable (e.g. S3 tables via port 443)
  4. Connecting to HBase:
    1. Zookeeper client port (normally 2181)
    2. HBase master (normally 60010)

Installed Compression Libraries

For standard Hadoop compression algorithms, you can choose the algorithm Datameer should use. However, if your Hadoop cluster is using a non-standard compression algorithm such as LZO, you need to install these libraries onto the Datameer machine. This is necessary so that Datameer can read the files it writes to HDFS, and decompress files residing on HDFS which you wish to import. Libraries which utilize native compression require both a Java (JAR) and native code component (UNIX packages). The Java component is a JAR file which needs to be placed into <Datameer folder>/etc/custom-jars. See Frequently Asked Hadoop Questions#Q. How do I configure Datameer/Hadoop to use native compression? for more details.

The configuration of Hadoop compression can drastically affect Datameer performance. See Hadoop Cluster Configuration Tips for more information.

Connecting Datameer to Your Hadoop Cluster

By default, Datameer isn't connected to any Hadoop cluster and operates in local mode, with all analytics and other functions performed by a local instance of Hadoop, which is useful for prototyping with small data sets, but not for high volume testing or production. To connect Datameer to a Hadoop cluster, change the settings in Datameer in the Admin > Hadoop Cluster page. Click the Admin tab at the top of the page and click the Hadoop Cluster tab in the left column. Click Edit and change the Mode setting.

These values can also be configured via /conf/live.properties. See /conf/default.properties for guidance.

The Datameer Private Folder must be set to the HDFS folder chosen in step 2, as seen below:

While running, Datameer stores imported data, sample data, workbook results, logs, and other information in this area (including subfolders such as importjobs, jobjars, temp). You need to make sure no other application interferes with data in this area of HDFS.

Mandatory Hadoop settings

Hadoop settings (properties) can be configured by Datameer in three places:

  1. Global settings in Datameer are located in the under the Administration tab in Hadoop Cluster.
  2. Per-job properties are configured for individual import/analytics jobs
  3. Property files (under /conf)

Setting the following properties ("propertyname=value") ensures that Datameer runs properly on your Hadoop cluster:

Caution

Datameer sets numerous Hadoop properties in the /conf folder for performance and other reasons, specifically das-job.properties. Don't change these properties without a clear understanding of what they are AND advice from Datameer as altering them they can cause Datameer jobs to fail.

WARNING

If any settings for the Hadoop cluster (mapred-site.xml, hdfs-site.xml, etc.) are set to FINAL, this overrides mandatory and optional Datameer settings, and Datameer might not work properly, or at all. Verify that FINAL is set to false for all settings configured by your Hadoop administrator or defaults set by your Hadoop distribution.

Name

Value

Description

Location

mapred.map.tasks.speculative.execution
mapred.reduce.tasks.speculative.execution

false

Datameer currently doesn't support speculative execution. However you don't need to configure these properties cluster-wide. Datameer disables speculative execution for every job it submits. You must ensure that these properties aren't set to 'true' cluster-wide with the final parameter also set to 'true'. This prevents a client from changing this property on a job by job basis.

das-job.properties

mapred.job.reduce.total.mem.bytes

0

Datameer turns off in memory shuffle - this could lead to an 'out of memory exception' in the reduce phase.

das-job.properties

mapred.map.output.compression.codec
mapred.output.compression.codec

usually one of:
org.apache.hadoop.io.compress.DefaultCodec
org.apache.hadoop.io.compress.GzipCodec
org.apache.hadoop.io.compress.BZip2Codec
com.hadoop.compression.lzo.LzoCodec
com.hadoop.compression.lzo.LzopCodec

Datameer cares about the what and the when of compression but not about the how. It uses the codec you have configured for your cluster. If you would like to change the codec to another available one, then set these properties in Datameer. Otherwise, Datameer uses the default in mapred-site.xml on the cluster. Furthermore, if you have configured a non-standard codec like LZO, it is necessary to install this codec on the machine running Datameer. See Frequently Asked Hadoop Questions#Q. How do I configure Datameer/Hadoop to use native compression?

mapred-site.xml

Highly Recommended Hadoop Settings

Name

Value

Description

Location

mapred.map.child.java.opts
mapreduce.reduce.java.opts
(hadoop-0.21, cdh3b3 only)

-Xmx1024M or more

By default, Datameer is configured to work with a minimum 1024 MB heap. Based on your slot configuration, you might have significantly more memory available per JVM

mapred-site.xml

mapred.child.java.opts
(<=hadoop-0.21)

-Xmx1024M or more

By default, Datameer is configured to work with a minimum 1024 MB heap. Based on your slot configuration, you might have significantly more memory available per JVM

mapred-site.xml

mapred.tasktracker.dns.interface
dfs.datanode.dns.interface


Hadoop nodes often have multiple network interfaces (internal vs. external). Explicity choosing an interface can avoid problems.


java.net.preferIPv4Stack

true

Disabling Ipv6 in Java can resolve problems and improve throughput. Set this property to ensure Ipv6 isn't used.

This can be configured in $HADOOP_OPTS environment variable on your Hadoop cluster.

mapreduce.jobtracker.staging.root.dir

/user

Avoids permission problems when a user other than superuser schedules jobs. (hadoop-0.21, cdh3b3 only)

mapred-site.xml (on the cluster site - doesn't work if configured as a custom property in Datameer)

The ipc.client.connect.max.retries=5 Hadoop parameter is hard-coded in Datameer and can't be changed.

Highly Recommended YARN Settings

NameValueDescriptionLocation

yarn.app.mapreduce.am.staging-dir

/user

The staging dir used while submitting jobs.

YARN requires a staging directory for temporary files created by running jobs. By default it creates /temp/hadoop-yarn/staging with restrictive permissions that might prevent your users from running jobs. To forestal this, you should configure and create the staging directory yourself. 

mapred-site.xml
yarn.application.classpath

CLASSPATH for YARN applications. A comma-separated list of CLASSPATH entries. When this value is empty, the following default CLASSPATH for YARN applications would be used.

For Linux: $HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/share/hadoop/common/*, $HADOOP_COMMON_HOME/share/hadoop/common/lib/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*, $HADOOP_YARN_HOME/share/hadoop/yarn/*, $HADOOP_YARN_HOME/share/hadoop/yarn/lib/*

For Windows: %HADOOP_CONF_DIR%, %HADOOP_COMMON_HOME%/share/hadoop/common/*, %HADOOP_COMMON_HOME%/share/hadoop/common/lib/*, %HADOOP_HDFS_HOME%/share/hadoop/hdfs/*, %HADOOP_HDFS_HOME%/share/hadoop/hdfs/lib/*, %HADOOP_YARN_HOME%/share/hadoop/yarn/*, %HADOOP_YARN_HOME%/share/hadoop/yarn/lib/*

yarn-site.xml

yarn.app.mapreduce.am.env

/usr/lib/hadoop/lib/native

Custom hadoop properties needed to be set for the native library with Hadoop distributions: CDH 5.3+, HDP2.2+, APACHE-2.6+

1

Job Scheduling and Prioritization

If you have configured a job scheduling/prioritization mechanism on your cluster (such as Fair Scheduler or Capacity Scheduler), you must decide which queue or pool Datameer should use (see the Hadoop Documentation ) for more information). To link Datameer to your scheduling mechanism, do the following:

  • Check your Hadoop configuration to see which property you should use to choose a pool for Datameer. For FairScheduler, this is mapred.fairscheduler.poolnameproperty, which is configured in conf/mapred-site.xml on your Hadoop cluster
  • Set this property globally for Datameer at Administration > Hadoop Cluster > Custom Property (e.g., mapred.fairscheduler.poolnameproperty="datameerPool")

Additional Resources