This topic provides tips on configuring Hadoop and Datameer for use in a shared cluster.
Tips on Configuring Hadoop
The identity of the user who launches Datameer is the identity Hadoop assumes should also be used on the Hadoop cluster. To set the user name up correctly, you need to create a user on the Hadoop cluster with the same name as the user who launches Datameer. That user on the Hadoop cluster must have permissions to read and write to the Datameer private folder on the cluster.
If you don’t set up the name correctly, you might experience the following problems:
- You might experience HDFS permissions errors when you attempt to run jobs (as Datameer tries unsuccessfully to manipulate files in the Datameer private folder, or incorrectly attempts to manipulate the directory of the HDFS root user).
- Jobs can be submitted into the wrong work queue (meaning that your job won't have the appropriate priority), or can be rejected.
See the Hadoop documentation for additional information.
When using Unix-like systems:
- The user name is the equivalent of `whoami`
- The group list is the equivalent of `bash -c groups`
If an intruder makes changes to the user name and group resolution by path manipulations, access permission checking can be bypassed.
Configuration of access permissions
Define this property in
Define permissions in HDFS
Datameer export jobs to HDFS
Datameer allows you to export data to remote filesystems. At the step New Export Job > Data Details you can enable clear output directory.
This deletes data from other applications in the same space too.
If you have configured a job scheduler on your cluster, you can easily configure which queue or pool Datameer should use. For example, if you have set up a fair share scheduler (see Apache FairScheduler), you can set this up by doing the following:
- Check which property needs to be set-up to configure the pool. This is the Hadoop property
conf/mapred-site.xmlof your Hadoop installation.
- Set this property in Datameer at Administration > Hadoop Cluster > Custom Property and set the pool that should be used by Datameer.
Using Data from Other Applications
Datameer can read files on HDFS generated by other MapReduce jobs and applications on top of Hadoop. Depending on the format of the files produced by other jobs, writing a plug-in for Datameer might or might not be required.
See the custom import plug-in definition and see Datameer Plug-in Tutorial for additional information on creating a custom plugin.
- HDFS Permissions Guide: https://hadoop.apache.org/docs/stable1/hdfs_permissions_guide.html
- Security in Hadoop, Part – 1 from Blog 'Big Data and etc.' http://bigdata.wordpress.com/2010/03/22/security-in-hadoop-part-1
- Using Fair Scheduler: http://hadoop.apache.org/docs/stable1/fair_scheduler.html