Spark Tuning Concepts

Tuning Background

In order to make sure Spark runs correctly, efficiently, and quickly with your instance of Datameer, you might have to tune your Spark settings. Tuning is the process of adjusting your settings to account for the memory, cores, and instances used by your system. This process ensures that Spark has the optimal performance and prevents resource bottlenecking. Through tuning, you change each property and setting to make sure resources are being used effectively based on your system-specific setup. When tuning, it is necessary to adjust your settings based on your organization's cluster and workload characteristics.

When used effectively, tuning:

Makes sure all resources are used efficiently.
Eliminates jobs that run unnecessarily long.
Improves performance time.
Ensures jobs are sent to the correct execution engine.

This concepts guide gives you a definition of each term and property you need to know to adjust your Spark settings appropriately. Use this guide to tune Spark's performance given your specific environment and settings.

Spark Architecture and Datameer

Datameer has several different modes (or execution engines) for Spark: SparkSX, SparkCluster, and SparkClient. To review how these modes work and when you might choose one over the other, refer to the Smart Execution with Spark documentation. For more information about how a Spark cluster works, refer to Spark's documentation.

Spark Terminology

In order to understand how to tune Spark, you need to understand the related terminology, including terms that refer to environments and terms that are specific for Spark tuning. While some of these terms aren't used on this page, they are used in our Spark Tuning Guide.

Environment terms

These terms are related to your environment. You need to keep in mind each of these settings in your environment to make sure you're tuning correctly.

Vcores (or cores)

Vcores stands for virtual cores. A vcore is one share of a host central processing unit (CPU), which is configurable and usually matches the number of physical CPU cores in use. Datameer initially allocates one vcore per container, but this number can be raised.

Master node

One node in a cluster where the resource manager runs. Used in the Spark Tuning Guide.

Worker node

One or more nodes in a cluster which can run application code and where the node managers run. Used in the Spark Tuning Guide.

Spark tuning terms

These terms describe settings that can be adjusted to tune Spark based on your environment.

Cgroups

Cgroups is a Linux mechanism for partitioning tasks into hierarchical groups, which means containers can be limited in their resource or CPU usage. Used in the Spark Tuning Guide.

Executors

A YARN container on a worker node that runs tasks and keeps data either in memory or on disk storage.

Instances

For Spark, an instance is the number of executors. For starting values, the number of instances should be equal to the number of cores.

Auto-scale

Auto-scaling removes or adds executors based on the size of the workload.

Resource Manager

A YARN service that knows where worker nodes are located on the cluster and how many resources each have. It runs the resource scheduler, which assigns jobs to resources. Used in the Spark Tuning Guide.

Node Manager

A YARN service that runs on a worker node and knows where containers are located on the nodes and determines how many containers are necessary for a job. Used in the Spark Tuning Guide.

DominantResourceCalculator

A resource calculator which does the math to decide where to allocate resources. The DominantResourceCalculator considers the CPU and required memory when performing calculations. Used in the Spark Tuning Guide.

DefaultResourceCalculator

A resource calculator which does the math to decide where to allocate resources. The DefaultResourceCalculator looks at only required memory when performing calculations.

Max idle time

The amount of time that SparkClient is idle before expiring.

Tasks

A unit of work that is sent to one executor.

Application Manager

An element of YARN, which is a framework-specific manager that executes and monitors containers and which resources they are consuming. Used in the Spark Tuning Guide.

Splits

A split is a section of data that is divided from others and spent to a specific node for processing.

Spark Tuning Settings

The following settings and properties can help you improve Spark performance and utility. A list of all properties used for Spark can be found under Advanced Spark Configuration.

Das properties

`das.execution-framework`

Description: Sets whether Spark runs using SparkSX, SparkClient, or SparkCluster.

Affects: Determines which execution framework is used.

When to adjust: Use SparkClient if you have a small cluster. Use Spark Cluster for large workloads. Use SparkSX for clusters with mixed workloads. SparkSX determines whether to use SparkClient or SparkCluster based on the number of workbooks.

`das.map-tasks-per-node`

Description: Calculates the optimal number of splits based on the number of physical nodes.

Affects: How many map tasks are set per node.

When to adjust: The map tasks per node should always be set to the available number of vcores of a node on a cluster, to make sure they are all being used effectively.

das.spark.context.auto-scale-enabled

Description: Auto-scaling is dynamic resource allocation provided by Datameer. For Spark, having auto-scale enabled means that each jobs dynamically allocates as much memory as necessary, depending on the data size of the job.

Affects: When auto-scale is enabled, there are resources and executors available for other applications or other Datameer jobs. When auto-scale is disabled, you need to configure the number of instances manually.

When to adjust: You can turn auto-scale off if you are only using SparkClient and need to save resources for other uses. Otherwise, this setting should always be enabled.

`das.spark.context.max-idle-time`

Description: The SparkClient context times out and shuts down after the time this property is set to. If it is set to 0, there is no context caching.

Affects: Every time Spark times out, it requires more resources to start up again and needs more time. If previous instances can be reused, this start up time is decreased.

When to adjust: Increase the idle time to have fewer startup costs and to have a longer cache, but only if jobs are submitted infrequently. Keep the idle time low to make sure that resources can be used by other execution frameworks or programs.

Spark properties

`spark.executor.cores`

Description: Determines the number of cores allocated to each executor.

Affects: Affects how much computation power each executor has. One core should be left for the application master.

When to adjust: If Spark jobs are running slowly, you can try increasing the number of cores. However, often memory errors cause slow performance, not the number of cores.

`spark.executor.memory`

Description: The amount of memory allocated for each executor process.

Affects: The performance of Spark. When Spark executors are allotted more memory, they perform faster.

When to adjust: Resources should be left free for the application master. To figure out this value, you can divide the total memory for Spark by the maximum amount of instances and leave some overhead memory (about 10%). Raise the memory to increase speed for Spark and to avoid out of memory errors.

`spark.executor.instances`

Description: When auto-scale is enabled, this value is the minimum number of Spark executors kept alive for Spark. These executors are kept alive when all executors are idle. One Spark executor is equivalent to one YARN container. When auto-scale is disabled, this is the exact number of worker instances for Spark.

Affects: How many instances are created automatically or assigned. With more instances, Spark jobs perform faster because the resources can work in parallel.

When to adjust: If auto-scale is enabled, this property's default value is 1, since that is the minimum number of instances required. If auto-scale is disabled, then this sets a fixed amount of instances for use by Spark jobs. Increase this value to see positive performance impacts. Keep it lower if resources should be shared between Spark and other execution engines.

Framework properties

If necessary, you can split out the properties to make different settings for SparkClient or SparkCluster. Any setting can be divided this way, but these settings in particular are used for the Spark Tuning Guide.

`framework.sparkclient.das.spark.context.auto-scale-enabled`

Description: Determines if auto-scale is enabled for SparkClient.

Affects: When auto-scale is enabled, it means that there are resources and executors available for other applications.

When to adjust: Turn off if you instead want to allocate a fixed amount or resources to SparkClient. Turn on if these executors should be dynamically allocated.

`framework.sparkcluster.das.spark.context.auto-scale-enabled`

Description: Determines if auto-scale is enabled for SparkCluster.

Affects: When auto-scale is enabled, it means that there are resources and executors available for other applications.

When to adjust: Turn off you instead want to allocate a fixed amount or resources to SparkClient. Turn on if these executors should be dynamically allocated.

`framework.sparkclient.spark.executor.instances`

Description: When auto-scale is enabled, this value is the minimum number of Spark executors kept alive for SparkClient. These executors are kept alive when all executors are idle. One Spark executor is equivalent to one YARN container. When auto-scale is disabled, this is the exact number of worker instances for SparkClient.

Affects: How many instances are created automatically or assigned. With more instances, Spark jobs perform faster because the resources can work in parallel.

When to adjust: If auto-scale is enabled, this property's default value is 1, since that is the minimum number of instances required. If auto-scale is disabled, then this sets a fixed amount of instances for use by Spark jobs. Increase this value to see positive performance impacts. Keep it lower if resources should be shared between Spark and other execution engines.

`framework.sparkcluster.spark.executor.instances`

Description: When auto-scale is enabled, this value is the minimum number of Spark executors kept alive for SparkCluster. These executors are kept alive when all executors are idle. One Spark executor is equivalent to one YARN container. When auto-scale is disabled, this is the exact number of worker instances for SparkCluster.

Affects: How many instances are created automatically or assigned. With more instances, Spark jobs perform faster because the resources can work in parallel.

When to adjust: If auto-scale is enabled, this property's default value is 1, since that is the minimum number of instances required. If auto-scale is disabled, then this sets a fixed amount of instances for use by Spark jobs. Increase this value to see positive performance impacts. Keep it lower if resources should be shared between Spark and other execution engines.

`framework.sparkcluster.das.spark.context.max-executors`

Description: Maximum amount of instances for each SparkCluster job.

Affects: If auto-scale is enabled, this limits the number of instances that can be used for each cluster job.

When to adjust: Adjust if auto-scale is enabled and you need to keep a limit to the number of instances used.

`framework.sparkclient.das.spark.context.max-executors`

Description: Maximum amount of instances for each SparkClient job.

Affects: If auto-scale is enabled, this limits the number of instances that can be used for each cluster job.

When to adjust: Adjust if auto-scale is enabled and you need to keep a limit to the number of instances used.

Datameer auto-scaling properties

These properties are required for Datameer, and they should be changed only if the cluster setup or queue settings for Datameer have changed.

`datameer.yarn.available-node-memory`

Description: The amount of memory per node available for Spark for auto-scaling.

Affects: How much memory Datameer can use for auto-scale.

When to adjust: This property should be set to match the memory for each worker node.

d`atameer.yarn.available-node-vcores`

Description: The amount of vcores per node available for Spark for auto-scaling.

Affects: How many vcores Datameer can use for auto-scale on each cluster node.

When to adjust: This property should be set to match the number of cores for each worker node.

Recommended Settings

Datameer recommends specific settings to help you utilize Spark as an execution engine. To see these recommendations, refer to the Spark Tuning Guide.

Tuning Background

Spark Architecture and Datameer

Spark Terminology

Environment terms

Vcores (or cores)

Master node

Spark tuning terms

Cgroups

Executors

Instances

Auto-scale

Resource Manager

Node Manager

DominantResourceCalculator

DefaultResourceCalculator

Max idle time

Tasks

Application Manager

Splits

Spark Tuning Settings

Das properties

das.execution-framework

das.map-tasks-per-node

das.spark.context.auto-scale-enabled

das.spark.context.max-idle-time

Spark properties

spark.executor.cores

spark.executor.memory

spark.executor.instances

Framework properties

framework.sparkclient.das.spark.context.auto-scale-enabled

framework.sparkcluster.das.spark.context.auto-scale-enabled

framework.sparkclient.spark.executor.instances

framework.sparkcluster.spark.executor.instances

framework.sparkcluster.das.spark.context.max-executors

framework.sparkclient.das.spark.context.max-executors

Datameer auto-scaling properties

datameer.yarn.available-node-memory

datameer.yarn.available-node-vcores

Recommended Settings

`das.execution-framework`

`das.map-tasks-per-node`

`das.spark.context.max-idle-time`

`spark.executor.cores`

`spark.executor.memory`

`spark.executor.instances`

`framework.sparkclient.das.spark.context.auto-scale-enabled`

`framework.sparkcluster.das.spark.context.auto-scale-enabled`

`framework.sparkclient.spark.executor.instances`

`framework.sparkcluster.spark.executor.instances`

`framework.sparkcluster.das.spark.context.max-executors`

`framework.sparkclient.das.spark.context.max-executors`

`datameer.yarn.available-node-memory`

d`atameer.yarn.available-node-vcores`