Page tree
Skip to end of metadata
Go to start of metadata

(warning) This page describes only the minimal set of steps needed for Datameer to operate properly in your Hadoop environment. For a more complete guide to Hadoop cluster design, configuration and tuning, please see Hadoop Cluster Configuration Tips or the additional resources section below.

1. User Identity

Your Hadoop cluster is accessed by Datameer as a particular Hadoop user. By default, the user is identified as the UNIX user who launched the Datameer application (equivalent to the UNIX command 'whoami'). To ensure this works properly, you should create a user of the same name within Hadoop's HDFS for Datameer to use exclusively for scheduling and configuration. This ensures the proper permissions are set, and that this user will be recognized whenever the Datameer application interacts with your Hadoop cluster.

The username used to launch the Datameer application can be configured in <Datameer folder>/etc/das-env.sh

2. Permissions in HDFS

When running Datameer with an on-premise Hadoop cluster (called "distributed mode"), you need to define an area (folder) within HDFS for Datameer to store its private data. The Hadoop user created for Datameer should have read/write access to this folder.

Hadoop permissions can be set with the following commands. Note that permission checking can be enabled/disabled globally for HDFS. See Configuring Datameer in a Shared Hadoop Cluster for more information.

If you experience a problem with the permissions of the staging directory when submitting jobs other then with the superuser. Basically there are two solutions for this:

  1. Configure mapreduce.jobtracker.staging.root.dir to /user in the mapred-site.xml of the cluster.
  2. Or, change the permissions of the hadoop.tmp.dir (usually /tmp) inside the hdfs to 777.

See the Hadoop documentation for additional information.

3. Local UNIX permissions

Various components of your Hadoop and Datameer environment (data nodes, task trackers, JDBC drivers, the Datameer application server) use a folder for temporary storage or "scratch" space. Additionally, HDFS files are sometimes stored in this location by default (see the Hadooop documentation). For each machine (the Datameer server, Hadoop master and slaves), you must ensure that the local UNIX user running these processes (often "hadoop" or "datameer", depending on the component) has read/write permissions for these folders, and of course to /tmp (commonly used by some components). Different components (e.g. Hive) can use different, configurable locations for scratch space. Check your configuration files for details.

In particular, Datameer makes use of the directory mapred.local.dir defined in your Hadoop configuration. Check that Datameer has write access to this folder, both on the Datameer server and cluster nodes. If not properly configured, you may see cryptic exceptions like the one reported here: https://issues.apache.org/jira/browse/MAPREDUCE-635

Take care that the Datameer application is consistently run by the same user, so that certain log files and other temporary data are not written as root and then locked when Datameer is subsequently started by another user. This can lead to inconsistent states which are difficult to troubleshoot (without access to the log files).

4. Network Connectivity

Hadoop Clusters are often secured behind firewalls, which may or may not be configured to allow Hadoop direct access to all data sources.  The systems containing data to be analyzed, including web servers, databases, data warehouses, hosted applications, message queues, mobile devices, etc. are often in remote locations and accessible through various specific TCP ports. Normally, Datameer is installed in the same LAN as Hadoop, and therefore suffers from the same connectivity issues, which can prevent you from using Datameer. Conversely, if Datameer is located further away from Hadoop, it might have connectivity to data but not the Hadoop NameNode and Job Tracker, for the same reasons.  Most of these issues are not anticipated early in the life cycle of a Hadoop project, and thus firewalls need to be reconfigured.    Finally, the speed and latency of the network link between Datameer, Hadoop, and your data can severely affect system performance and the end user experience. For these reasons, it's important to thoroughly review all network connectivity requirements up front. 

Before you start using Datameer, check that the following network connectivity is available on the specified ports:

  1. Datameer to Hadoop NameNode client port: (normally 8020, 9000 or 54310)
  2. Datameer to Hadoop JobTracker client port: (normally 8021, 9001 or 54311)
  3. Datameer to Hadoop DataNodes (normally ports 50010 and 50020)
  4. (Recommended): Datameer to Hadoop Administration consoles (normally 50030 and 50070)

Depending on your configuration, the following connectivity may also be necessary:

  1. Importing/exporting via SFTP:
    1. Datameer to source (normally port 22)
    2. Hadoop slaves to source (normally port 22)
  2. Importing/linking/exporting to/from an RDBMS:
    1. Datameer to RDBMS via JDBC (various ports, see your DB documentation)
    2. Hadoop slaves to RDBMS via JDBC (various ports, see your DB documentation)
  3. Connecting to Hive:
    1. Hive Metastore (normally port 10000)
    2. Location of external Hive tables, if applicable (e.g. S3 tables via port 443)
  4. Connecting to HBase:
    1. Zookeeper client port (normally 2181)
    2. HBase master (normally 60010)

5. Installed Compression Libraries

For standard Hadoop compression algorithms, you can choose the algorithm Datameer should use. However, if your Hadoop cluster is using a non-standard compression algorithm such as LZO, you will need to install these libraries onto the Datameer machine. This is necessary so that Datameer can read the files it writes to HDFS, and decompress files residing on HDFS which you wish to import. Libraries which utilize native compression require both a Java (JAR) and native code component (UNIX packages). The Java component is a JAR file which needs to be placed into <Datameer folder>/etc/custom-jars. See Frequently Asked Questions#Q. How do I configure Datameer/Hadoop to use native compression? for more details.

Note

Icon

The configuration of Hadoop compression can drastically affect Datameer performance. See Hadoop Cluster Configuration Tips for more information.

6. Connecting Datameer to your Hadoop cluster

By default, Datameer is not connected to any Hadoop cluster and operates in "LOCAL" mode, with all analytics and other functions performed by a local instance of Hadoop, which is useful for prototyping with small data sets, but not for high volume testing or production. To connect Datameer to a Hadoop cluster, change the settings in Datameer in the Administration - Hadoop Cluster page. Click the Administration tab at the top of the page and click the Hadoop Cluster tab in the left column. Click Edit and change the Mode setting.

Note: These values can also be configured via /conf/live.properties. See /conf/default.properties for guidance.

To Show Me How, see Setting up HDFS for a video demonstration.

Note that the "Datameer Root Directory" must be set to the HDFS folder chosen in step 2, as seen below:

While running, Datameer stores imported data, sample data, workbook results, logs, and other information in this area (including subfolders such as importjobs, jobjars, temp). You need to make sure no other application interferes with data in this area of HDFS.

7. Mandatory Hadoop settings

Hadoop settings ("properties") can be configured by Datameer in three places:

  1. Global settings in Datameer are located in the under the Administration tab in Hadoop Cluster.
  2. Per-job properties are configured for individual import/analytics jobs
  3. Property files (under /conf)

Setting the following properties ("propertyname=value") ensures that Datameer runs properly on your Hadoop cluster:

Caution

Icon

Datameer sets numerous Hadoop properties in the /conf folder for performance and other reasons, specifically das-job.properties. Do not change these properties without a clear understanding of what they are AND advice from Datameer as altering them they can cause Datameer jobs to fail.

WARNING

Icon

If any settings for the Hadoop cluster (mapred-site.xml, hdfs-site.xml, etc.) are set to FINAL, this will override mandatory and optional Datameer settings, and Datameer may not work properly, or at all. Please verify that FINAL is set to FALSE for all settings configured by your Hadoop administrator or defaults set by your Hadoop distribution.

Name

Value

Description

Location

mapred.map.tasks.speculative.execution
mapred.reduce.tasks.speculative.execution

false

(warning) Datameer currently does not support speculative execution. However you do not need to configure these properties cluster-wide. Datameer disables speculative execution for every job it submits. You must ensure that these properties are not set to 'true' cluster-wide with the final parameter also set to 'true'. This prevents a client from changing this property on a job by job basis.

das-job.properties

mapred.job.reduce.total.mem.bytes

0

Datameer turns off in memory shuffle - this could lead to an 'out of memory exception' in the reduce phase.

das-job.properties

mapred.map.output.compression.codec
mapred.output.compression.codec

usually one of:
org.apache.hadoop.io.compress.DefaultCodec
org.apache.hadoop.io.compress.GzipCodec
org.apache.hadoop.io.compress.BZip2Codec
com.hadoop.compression.lzo.LzoCodec
com.hadoop.compression.lzo.LzopCodec

Datameer cares about the what and the when of compression but not about the how. It uses the codec you have configured for your cluster. If you would like to change the codec to another available one, then set these properties in Datameer. Otherwise, Datameer will use the default in mapred-site.xml on the cluster. Furthermore, if you have configured a non-standard codec like LZO, it is necessary to install this codec on the machine running Datameer. See Frequently Asked Questions#Q. How do I configure Datameer/Hadoop to use native compression?

mapred-site.xml

8. Highly Recommended Hadoop settings

Name

Value

Description

Location

mapred.map.child.java.opts
mapreduce.reduce.java.opts
(hadoop-0.21, cdh3b3 only)

-Xmx512M or more

By default, Datameer is configured to work with a minimum 512MB heap. Based on your slot configuration, you may have significantly more memory available per JVM

mapre-site.xml

mapred.child.java.opts
(<=hadoop-0.21)

-Xmx512M or more

By default, Datameer is configured to work with a minimum 512MB heap. Based on your slot configuration, you may have significantly more memory available per JVM

mapre-site.xml

mapred.tasktracker.dns.interface
dfs.datanode.dns.interface

 

Hadoop nodes often have multiple network interfaces (internal vs. external). Explicity choosing an interface can avoid problems.

 

java.net.preferIPv4Stack

true

Disabling Ipv6 in Java can resolve problems and improve throughput. Set this property to ensure Ipv6 is not used.

This can be configured in $HADOOP_OPTS environment variable on your Hadoop cluster.

mapreduce.jobtracker.staging.root.dir

/user

Avoids permission problems when a user other than superuser schedules jobs. (hadoop-0.21, cdh3b3 only)

mapred-site.xml (on the cluster site - does not work if configured as a custom property in Datameer)

9. Job Scheduling and Prioritization

If you have configured a job scheduling/prioritization mechanism on your cluster (such as Fair Scheduler or Capacity Scheduler), you must decide which queue or pool Datameer should use (see the Hadoop Documentation ) for more information). To link Datameer to your scheduling mechanism, you must do the following:

  • Check your Hadoop configuration to see which property you should use to choose a pool for Datameer. For FairScheduler, this is mapred.fairscheduler.poolnameproperty, which is configured in conf/mapred-site.xml on your Hadoop cluster
  • Set this property globally for Datameer at Administration -> Hadoop Cluster -> Custom Property (e.g. mapred.fairscheduler.poolnameproperty="datameerPool")

10. Additional Resources

  • No labels