Scheduling Jobs with the Cluster
To help optimize jobs and priorities, you can set how jobs are scheduled for the Hadoop cluster.
If not using impersonation, you can set the scheduling of jobs for specific cluster queues at either a global or per job level.
- Open the Admin tab.
- Select Hadoop Cluster from the side menu.
Add the following property in the Custom Properties space.
- Navigate through the job wizard when setting up or configuring a job.
- Add the properties listed above in the Custom Properties space.
Datameer users that are running impersonation don't need to set any scheduling properties in Datameer. Jobs coming from Datameer already are labeled and all configuration for the queues are made on the Hadoop cluster itself.
Finding the Optimal Split Size/Split Count
The optimal split size and count for a Hadoop job is calculated by Hadoop from the values for max/min split size and max/min split count.
The default values are:
For a job with an input size of 89 MB (473774773 records) Hadoop would use a split count of 5 which is approximately the minimum split size (16 MB).
If you change the split size and split count recommendations for Hadoop to
Hadoop comes to the following values:
The split size has been rounded, resulting in 83 splits rather than 89. In general this looks good, but it appears that the minimum split size is a major factor for choosing the right values.
What about the map task capacity of the Hadoop cluster for this job run? The cluster used for this example has a map task capacity of 28. This means it runs a maximum of 28 jobs in parallel, therefore using only a maximum split count of 5 won't be optimal. Also using a split count of 83 returns better results, but this also creates an overhead in regards to size and map task creation/communication.
Optimization would include the reduction of tasks by utilizing all nodes in the cluster.
Lets try a split count of 28.
Is it best to set the number of splits equal to the total map task capacity? Well, this can result in pretty large splits, which might lead to extremely long-running tasks that could then block other Hadoop tasks. Here Hadoop couldn't optimize the execution of tasks.
A better approach is to have a split count that is a multiple of the map task capacity. In this case the cluster is scaled properly, and the time a task blocks the Hadoop cluster is reduced.
InputSize is the size of the input data for a specific job, e.g. for an import job it is the size of the data imported and for a workbook it is the size of the data resulting from an import job. The mapTaskCapacity is property used by the Hadoop cluster the job runs on. To find the optimal split size you should now calculate the splitSize value using the optimal multiplier. The multiplier can be calculated using the formula below.
Set the split size (in bytes) in the Hadoop properties section for the job.
If you calculated a split size of 3MB set it with the following commands.
- Job input data size: 89MB
- Map task capacity: 28
- Job input record count: 473774773
- Jop input data size: 1MB
- Map task capacity: 28
- Job input record count: 94932
Logging file split resolving
In order to view the split files in the log:
- Start setting up the Import Job
- Under Scheduling; add the line below in the text area for setting up Hadoop properties