For Datameer v6.1, Spark is only supported on Cloudera, Hortonworks, MapR Native Secure, and Apache Hadoop distributions.
|Table of Contents|
Spark Execution Frameworks
When Smart Execution™ is enabled, it uses SparkSX mode by default. SparkSX chooses between two Spark modes: SparkCluster or SparkClient. It delegates to one of them based on the size of the job. Very large data sizes (over 100GB) are delegated to Tez before switching to Spark. Data sizes under 10GB are delegated to SparkClient, and sizes between 10GB and 100GB are delegated to SparkCluster. When input data sizing can't be determined, SparkSX selects SparkCluster.
SparkClient is a YARN client mode used for workloads under 10 gigabytes of data. It caches and reuses a shared small virtual cluster of Spark executors across Datameer jobs. Executors on SparkClient mode are shut down after a configurable idle time. SparkCluster is a YARN cluster mode that runs a single Datameer job and is used for workloads under 100 gigabytes of data. On SparkCluster, executors are maintained for the life of the Datameer job and shut down once the job completes.
Once Datameer jobs are compiled and sent to YARN, they are known as workloads and run in a pool labeled “Datameer SparkClient”. Datameer continually pushes all new jobs into that pool, including jobs that are unrelated to each other. As this pool is a shared resource, job names of the Datameer workload running inside are not listed. SparkClient uses Java virtual machines that run constantly when being utilized with workloads, as in Tez session pooling.
SparkClient can't be used when secure impersonation is enabled for Datameer, because the YARN application must be owned by the Datameer job owner and can't be shared between users. Additionally, SparkClient might require the Datameer server to be configured with more heap space, since some file merging logic is performed on the Datameer server side. This could be as much as a few hundred megabytes per concurrent job. These limitations are offset by the fact that SparkClient uses less cluster resources and reduce job setup latency when you're running many small to medium sized workloads.
SparkCluster mode uses the Yarn cluster deployment mode. Each Datameer job run with the execution framework spawns a separate Java virtual machine on the Datameer server machine through Spark's SparkLauncher API. Currently SparkCluster mode doesn't support sharing the YARN applications across multiple Datameer jobs. One Spark YARN application is spawned per job. This execution framework also allows Datameer's dynamic resource allocation to grow the YARN application when necessary. The application and all executors are deallocated once the Datameer job completes.
SparkCluster works when Datameer is configured to perform Secure Impersonation. This mode doesn't require the Datameer server to be configured to use more heap space, but still requires the machine Datameer is running on to have enough memory to support each concurrent Java virtual machine spawned by SparkCluster. Despite the startup time of the Spark YARN application and executors, this execution framework is faster than MapReduce in all cases and Tez in most cases. To support many SparkCluster jobs running concurrently, the memory allocated to each of these SparkSubmit JVMs is defaulted to a very low value:
das.spark.launcher.spark-submit-opts=-XX:MaxPermSize=48m -Xmx16m. If the SparkSubmit JVM fails due to an out of memory error, this memory setting might need to be increased.
Basic Spark Configuration
For more information about all of the properties Spark offers, see the Advanced Spark Configuration page.
To use Spark, you must have:
- A Smart Execution license
- A YARN-based Hadoop cluster (version 2.2 or higher)
- If using more than one core per Spark executor, Datameer recommends enabling the DominantResourceCalculator for the cluster. (When using a cluster manager, enabling CPU scheduling often turns on DominantResourceCalculator.)
- Cluster nodes with enough memory to run Spark executors concurrently with other workloads. Here are some recommendations:
- Executors with four VCore and 16.5 GB of base memory
- VCores based on the number of local disks and memory was chosen to minimize GC costs
- More medium-sized executors is usually better than a few very large ones.
Configure Spark in SparkSX mode (recommended)
- Go to the Admin tab > Hadoop Cluster and click Edit.
In the Custom Properties section enter the property
The SparkSX framework has default settings for the available resources on the cluster. To run SparkSX jobs efficiently on Spark, you need to change the following defaults based on your cluster configuration:
8, auto in v6.3 The number of cores available on each node of the cluster.
8g, auto in v6.3 The amount of memory available on each node of the cluster.
SparkSX determines the number of available nodes from the Hadoop Cluster configuration. Based on resources available on each node, as specified by the above two properties, it allocates as many worker nodes as necessary for the job.
Note title As of Datameer 6.3
The following properties were renamed and now are set to auto:
Setting these to auto uses the available memory or vcores for the node on the cluster with the lowest memory or vcore configuration. This new setting option means you no longer have to configure the properties manually based on your Datameer configuration. When the properties are set to auto, they logs cluster values, while if they are set to a specific value, they log the configured amount.
If you are using HDP, set the HDP version properties in Custom Properties.
- Click Save.
Configure Spark for all jobs
- Go to the Admin tab > Hadoop Cluster and click Edit.
- In the Custom Properties section, enter das.execution-framework=SparkSX.
- Click Save.
Configure Spark for specific jobs
Go to the File Browser tab and right-click on a workbook.
Under Advanced, enter das.execution-framework=SparkSX in the Custom Properties section.
- Click Save.
Verify that jobs are running
To verify that your jobs are running, you can view the Job Details page.
Security with Spark
Keep in mind the following security limitations:
Configure SparkSX with a secure Hadoop Kerberos cluster or a native secure MapR cluster
Once the SparkSX framework decides to run the Datameer job using SparkCluster execution framework, Datameer obtains the necessary delegation tokens for the user who is running the job:
There is no extra configuration to use a Kerberos secure Hadoop cluster or a native secure MapR cluster with SparkSX. The existing configuration on Datameer side specified at Hadoop Admin page to connect with a secure Hadoop cluster is enough.
If you want to run in secure impersonation mode, there are a few extra configuration steps required:
- Follow the steps under Preparing Datameer Installation for Secure Impersonation
For MapR, set the
Code Block theme Eclipse
Spark Tuning Guide
See the Spark Tuning Guide page for more information about how to make sure Spark is running efficiently for your system.
Advanced Spark Configuration
If you're interested in further configuration, you can find a list of configuration properties and their default values on the Advanced Spark Configuration page
You can use the Spark history server to troubleshoot and reconstruct the UI of a finished application using the application's event logs.
Enable Spark event logging (server side)
Set the following properties in Datameer to enable event logging:
Whether to compress logged events, if
Base directory in which Spark events are logged, if
|Determines whether to log Spark events, useful for reconstructing the Web UI after the application has finished.|
The address of the Spark history server, e.g.,
Run history server (daemon side)
Make sure you run the history server command as a user with permission to read the files in the
Issues and notes
The history server only displays completed Spark jobs. One way to signal the completion of a Spark job is to stop the Spark context explicitly.
The history server doesn't show applications after start up.
Loading all the applications could take a long time. Look at the history server logs to see the progress or errors (
You could also run the history server on your local machine and configure it to read the logs from your remote HDFS.
- Jobs running on Spark have their samples and column metrics merged in the Datameer Conductor JVM, so it needs a bit more memory based on how many jobs are running concurrently.
- The SparkCluster eventually times out and shuts down when it has been idle (by default, one minute). To reduce the startup costs it might make sense to increase the idle timeout time. The idle timeout time can be configured with the
- Job progress information isn't as good as Tez or MapReduce. In some cases the job progress goes from zero directly to 100 percent. The job logs have detailed information on what is running.
- If multiple SparkCluster jobs are submitted around the same time, sometimes they stay in a running state indefinitely. If this happens, check the application master logs for this message:
"YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources"
If this message is printed, it means the jobs have maxed out the cluster resources and are deadlocked. In this situation, the jobs have to be killed from the YARN cluster interface. The way to avoid this situation or to prevent more jobs being submitted when the cluster is close to full capacity, is to set the max concurrent jobs to a lower value on the Hadoop Cluster page.