Using Compression with Hadoop and Datameer

Datameer FAQ About Compression

How do I configure Datameer/Hadoop to use native compression?

When working with large data volumes, native compression can drastically improve the performance of a Hadoop cluster. There are multiple options for compression algorithms. Each have their benefits, e.g., GZIP is better in terms of disk space, LZO in terms of speed.

  1. First determine the best compression algorithm for your environment (see the "compression" topic, under Hadoop Cluster Configuration Tips)
  2. Install native compression libraries (platform-dependent) on both the Hadoop cluster (<HADOOP_HOME>/lib/native/Linux-[i386-32 | amd64-64]) and the Datameer machine (<das install dir>/lib/native/Linux-[i386-32 | amd64-64] )


Configure the codec use as custom properties in the Hadoop Cluster section in Datameer:

mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec

Custom Hadoop properties needed for the native library in CDH 5.3+, HDP2.2+, APACHE-2.6+:

tez.am.launch.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib>
tez.task.launch.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib>
yarn.app.mapreduce.am.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib>
yarn.app.mapreduce.am.admin.user.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib>

You can find the correct absolute path in your cluster settings. Depending on your distribution, the absolute path to the native library might look like one of the following examples:

/usr/lib/hadoop/lib/native
/opt/cloudera/parcels/CDH/lib/hadoop/lib/native
/usr/hdp/current/hadoop/lib/native/

Or you can use settings such as:

=LD_LIBRARY_PATH=$HADOOP_COMMON_HOME/lib/native:$JAVA_LIBRARY_PATH:<absolute_path_to_native_lib>


How do I configure Datameer/Hadoop to use LZO native compression?

Add corresponding Java libraries to Datameer/Hadoop and follow the step-by-step guide below to implement the LZO compression codec.

Setup steps for LZO compression in Datameer

In order to let Datameer work with files compressed by LZO, its required to add corresponding Java libraries into the Datameer installation folder and point the application to the list of compression codecs to use.

The following steps are only applicable for the local execution framework on respective Datameer Workgroup editions.

  1. Open the <datameer-install-path>/conf/das-conductor.properties file and ensure that LZO is listed as a das.import.compression.known-file-suffixes property value (this should be a default value).

    conf/das-conductor.properties

    ## Importing files ending with one of those specified suffixes will result in an exception if

    ## the proper compression-codec can't be loaded. This helps to fail fast and clear instead of displaying spaghetti data.
    das.import.compression.known-file-suffixes=zip,gz,bz2,lzo,lzo_deflate,Z
  2. Open the <datameer-install-path>/conf/das-common.properties file and add the values below under the section ## HADOOP Properties Additions.

    conf/das-common.properties

    io.compression.codecs=org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec

    io.compression.codec.lzo.class=com.hadoop.compression.lzo.LzoCodec
  3. Add liblzo2* libraries to <datameer-install-path>/lib/native/. Usually there are 3 files *.so, *.so.2, *.so.2.0.0. Sometimes LZO libs might already be installed somewhere on the machine, if they aren't, it would be required to add them by installing LZO related packages. Install LZO related packages by following the directions below:

    Installing LZO packages
    Execute the following command at all the nodes in your cluster:

    RHEL/CentOS/Oracle Linux:
    yum install lzo lzo-devel hadooplzo hadooplzo-native

    For SLES:
    zypper install lzo lzo-devel hadooplzo hadooplzo-native

    For Ubuntu/Debian:
    HDP support for Debian 6 is deprecated with HDP 2.4.2. Future versions of HDP will no longer be supported on Debian 6.
    apt-get install liblzo2-2 liblzo2-dev hadooplzo
  4. Add libgplcompression.* libraries to <datameer-install-path>/lib/native/[Linux-amd64-64/i386-32]. Usually there are 5 of files *.a, *.la, *.so, *.so.0, *.so.0.0.0. LZO libs might already be installed somewhere on the machine, but if they aren't, it would be required to add them. Follow Hadoop gpl packaging or Install hadoop-gpl-packaging to do this.

    At this point, it might be be required to move the library files directly into the native folder.

  5. Add the corresponding JAR library files to <datameer-install-path>/etc/custom-jars/. You could load hadoop-lzo-<version>.jar from hadoop-lzo and hadoop-gpl-compression-<version>.jar from hadoop-gpl.
  6. Restart Datameer.

Setup steps for LZO compression on a Hadoop cluster

 In order to be able to work with LZO compression over a Hadoop cluster, it is required to add corresponding Java libraries into Hadoop configuration folders and set appropriate properties in the configuration files.

  1. Copy liblzo2* and libgplcompression.* libraries from <datameer-install-path>/lib/native/ to the corresponding folder of the Hadoop distribution. Execute command hadoop checknative -a to find the Hadoop's native libs location. Here are possible locations for different versions:

    hadoop checknaticve -a

    /usr/lib/hadoop/lib/native
    /opt/cloudera/parcels/CDH/lib/hadoop/lib/native
    /usr/hdp/current/hadoop/lib/native/

    For some configurations libgplcompression.* libraries should be moved from [Linux-amd64-64/i386-32] folder directly to /lib/native. Ensure that all symlinks remain the same after copying.

  2. Copy appropriate JAR library files from <datameer-install-path>/etc/custom-jars/ to the corresponding /lib folder of the Hadoop distribution (this is usually the parent folder for the native library).
  3. Add the LZO codec information to the Hadoop configuration files by opening $HADOOP_HOME/conf/core-site.xml and insert data provided below to the property io.compression.codecs.

    core-site.xml

    com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec 

    add add the following directly after the io.compress.codecs property:

    core-site.xml

    <property>
    <name>io.compression.codec.lzo.class</name>
    <value>com.hadoop.compression.lzo.LzoCodec</value>
    </property>

  4. Add the following custom properties in Datameer on the Hadoop Cluster page:

    Datameer Custom Properties

    tez.am.launch.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib>
    tez.task.launch.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib>
    yarn.app.mapreduce.am.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib>
    yarn.app.mapreduce.am.admin.user.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib>

The above settings should be implemented at all cluster nodes. It is recommended to restart cluster services after setup.

How do I configure Datameer/Hadoop to use Snappy native compression?

Snappy compression codec provides high speed compression with reasonable compression ratio. See the original documentation for more details.

  1. For now, CDH3u1 and newer versions are containing Snappy compression codec already. This page contains the configuration instructions. In addition, Snappy is integrated into Apache Hadoop versions 1.0.2 and 0.23.
  2. Using Cloudera's distribution of Hadoop is required to enable the codec inside Datameer application, either in Hadoop Cluster settings or on per job basis. Add the following settings:

    io.compression.codecs=org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec
    mapred.output.compress=true
    mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
    mapred.output.compression.type=BLOCK

    If you encounter errors, set the following custom Hadoop properties might be needed for the native library in CDH 5.3+, HDP2.2+, APACHE-2.6+: 

    yarn.app.mapreduce.am.env=LD_LIBRARY_PATH=/usr/lib/hadoop/lib/native
    tez.am.launch.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib>
    tez.task.launch.env=LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<absolute_path_to_native_lib>
    yarn.app.mapreduce.am.env=LD_LIBRARY_PATH=$HADOOP_COMMON_HOME/lib/native:$JAVA_LIBRARY_PATH:/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/lib/native
    yarn.app.mapreduce.am.admin.user.env=LD_LIBRARY_PATH=$HADOOP_COMMON_HOME/lib/native:$JAVA_LIBRARY_PATH:/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/lib/native
    mapreduce.admin.user.env=LD_LIBRARY_PATH=$HADOOP_COMMON_HOME/lib/native:$JAVA_LIBRARY_PATH:/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/lib/native

    Make sure matching versions of Snappy are installed in Datameer and on your Hadoop cluster (e.g. verify via checksums from the library files). Version mismatch can result in erroneous behavior during codec loading or job execution.

    As of Datameer Version 5.6

    To prevent out of memory errors, set das.job.container.memory-heap-fraction=0.7

Using 3rd Party Compression Libraries 

Datameer in general can deal with all sort of compression codecs that are supported by Hadoop, with the exception of EMR. The following examples show you how to enable compression. 

Enabling LZO compression

For more information from where to download the LZO codec jar files see the Apache Hadoop wiki - UsingLzoCompression page.

  1. Copy native LZO compression libraries to your distribution:

    cp /<path>/liblzo2* /<datameer-install-dir>/lib/native/Linux-i386-32
    

    or on 64 bit machines:

    cp /<path>/liblzo2* /<datameer-install-dir>/lib/native/Linux-amd64-64
  2. Copy codec jar file to etc/custom-jars

    cp /<path>/hadoop-gpl-compression-0.1.0-dev.jar etc/custom-jars
  3. Restart the conductor.

    bin/conductor.sh restart
  4. Make sure that the compression libraries are installed on your Hadoop cluster by following the documentation under Apache Hadoop wiki - UsingLzoCompression.

Enabling SNAPPY Compression

If you are using a Datameer version for Cloudera, in this example CDH5.2.1-MR1, on the Datameer host the native libraries are already available. 

  1. Check native SNAPPY compression libraries on Datameer host.

    datameer@datameer-node Datameer-5.3.0-cdh-5.2.1-mr1]$ ll lib/native/ 
    ... 
    -rw-r--r-- 1 datameer datameer 19848 23. Dez 06:39 libsnappy.so 
    -rw-r--r-- 1 datameer datameer 23904 23. Dez 06:39 libsnappy.so.1 
    -rw-r--r-- 1 datameer datameer 23904 23. Dez 06:39 libsnappy.so.1.1.3
  2. Make sure that the compression libraries are installed on your Hadoop cluster.

    [root@cloudera-cluster lib]# pwd 
    /usr/lib 
    [root@cloudera-cluster lib]# find . -name 'snappy*' 
    ./hadoop/lib/snappy-java-1.0.4.1.jar 
    ... 
    ./hadoop-mapreduce/lib/snappy-java-1.0.4.1.jar 
    ./hadoop-mapreduce/snappy-java-1.0.4.1.jar
  3. Copy the from Cloudera provided snappy-java-1.0.4.1.jar to the Datameer node and put the codec into the etc/custom-jars folder.

    [root@datameer-node]# cp snappy-java-1.0.4.1.jar /opt/Datameer-5.3.0-cdh-5.2.1-mr1/etc/custom-jars 
    [root@datameer-node]# chown datameer:datameer /opt/Datameer-5.3.0-cdh-5.2.1-mr1/etc/custom-jars/*.* 
    [root@datameer-node]# ll /opt/Datameer-5.3.0-cdh-5.2.1-mr1/etc/custom-jars 
    -rw-r--r-- 1 datameer datameer 960374 17. Okt 08:05 mysql-connector-java-5.1.34-bin.jar 
    -rw-r--r-- 1 datameer datameer 995968 23. Feb 05:16 snappy-java-1.0.4.1.jar
  4. Restart the conductor.

    bin/conductor.sh restart
  5. Check in logs/conductor.log that the SNAPPY plugin became loaded. Creating a Datalink and import SNAPPY compressed AVRO files is now possible. 

Hadoop LZO Compression on Linux

Installing Hadoop and lzo compression on Linux

You can find some precompiled lzo-libs in our jungledisk:

  • hadoop-lzo/mac_64 - for mac os x
  • hadoop-lzo/lzo-hadoop-libs-linux-.tar - for linux 32/64 bit

Install the following:

  • Useful tools
  • Zo libs and headers
  • Liblzo2-2 liblzo2-dev on debian based systems apt-get/aptitude
  • Lzo lzo-devel on redhat based systems rpm/yum
  • Git
  • Java
  • Ant

You must change the hostnames and paths in the config files to match your environment.

Installing Hadoop-lzo (32 bit)

mkdir lzo
cd lzo
git clone http://github.com/kevinweil/hadoop-lzo.git
cd hadoop-lzo
ant clean compile-native test tar
cd ../..

wget http://mirror.synyx.de/apache//hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz
tar xf hadoop-0.20.2.tar.gz
ln -s hadoop-0.20.2 hadoop
cd hadoop
cp ../lzo/hadoop-lzo/build/native/Linux-i386-32/lib/* lib/native/Linux-i386-32
cp ../lzo/hadoop-lzo/build/hadoop-lzo-0.4.9.jar lib/

Installing Hadoop-lzo (64 bit)

mkdir lzo
cd lzo
git clone http://github.com/kevinweil/hadoop-lzo.git
cd hadoop-lzo
ant clean compile-native test tar
cd ../..

wget http://mirror.synyx.de/apache//hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz
tar xf hadoop-0.20.2.tar.gz
ln -s hadoop-0.20.2 hadoop
cd hadoop
cp ../lzo/hadoop-lzo/build/native/Linux-amd64-64/lib/* lib/native/Linux-amd64-64
cp ../lzo/hadoop-lzo/build/hadoop-lzo-0.4.9.jar lib/

Configuring Hadoop to use lzo

You have to set $JAVA_HOME for hadoop:

nano -w $HADOOP_HOME/conf/hadoop-env.sh

You need to modify core-site.xml / mapred-site.xml / hdfs-site.xml:

nano -w $HADOOP_HOME/conf/core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
     <property>
         <name>fs.default.name</name>
         <value>hdfs://alpha:9000</value>
     </property>
<property>
  <name>fs.checkpoint.dir</name>
  <value>/data/dfs/namesecondary</value>
  <description>Determines where on the local filesystem the DFS secondary
      name node should store the temporary images to merge.
      If this is a comma-delimited list of directories then the image is
      replicated in all of the directories for redundancy.
  </description>
</property>

<property>
  <name>hadoop.tmp.dir</name>
  <value>/data/hadoop/data/hadoop</value>
  <description>A base for other temporary directories.</description>
</property>

  <property>
    <name>io.compression.codecs</name>
    <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value>
  </property>
  <property>
    <name>io.compression.codec.lzo.class</name>
    <value>com.hadoop.compression.lzo.LzoCodec</value>
  </property>

  <property>
    <name>mapred.compress.map.output</name>
    <value>true</value>
  </property>
  <property>
    <name>mapred.map.output.compression.codec</name>
    <value>com.hadoop.compression.lzo.LzoCodec</value>
  </property>

</configuration>
nano -w $HADOOP_HOME/conf/mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
     <property>
         <name>mapred.job.tracker</name>
         <value>alpha:9001</value>
     </property>

<property>
  <name>mapred.local.dir</name>
  <value>/data/tmp/mapred/local</value>
  <description>The local directory where MapReduce stores intermediate
  data files.  May be a comma-separated list of
  directories on different devices in order to spread disk I/O.
  Directories that do not exist are ignored.
  </description>
</property>

<property>
  <name>mapred.system.dir</name>
  <value>/data/mapred</value>
  <description>The shared directory where MapReduce stores control files.
  </description>
</property>

</configuration>
nano -w $HADOOP_HOME/conf/hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
     <property>
         <name>dfs.replication</name>
         <value>1</value>
     </property>
<property>
  <name>dfs.permissions</name>
  <value>false</value>
  <description>
    If "true", enable permission checking in HDFS.
    If "false", permission checking is turned off,
    but all other behavior is unchanged.
    Switching from one parameter value to the other does not change the mode,
    owner or group of files or directories.
  </description>
</property>

<property>
  <name>dfs.data.dir</name>
  <value>/data/dfs/data</value>
  <description>Determines where on the local filesystem an DFS data node
  should store its blocks.  If this is a comma-delimited
  list of directories, then data will be stored in all named
  directories, typically on different devices.
  Directories that do not exist are ignored.
  </description>
</property>

<property>
  <name>dfs.name.dir</name>
  <value>/data/dfs/name</value>
  <description>Determines where on the local filesystem the DFS name node
      should store the name table(fsimage).  If this is a comma-delimited list
      of directories then the name table is replicated in all of the
      directories, for redundancy. </description>
</property>

</configuration>

Configure passwordless SSH access:

cd
ssh-keygen
cat .ssh/id_rsa.pub > .ssh/authorized_keys
ssh localhost
exit

Modify your .bashrc, .bash_profile:

cd
nano -w ~/.bashrc
export HADOOP_HOME=/home/tester/hadoop
export PATH=$HADOOP_HOME/bin:$PATH
source .bashrc

Prepare Hadoop and run a test:

hadoop namenode -format
start-all.sh
hadoop jar /home/hadoop/hadoop/hadoop-0.20.2-test.jar TestDFSIO  -Dmapred.child.java.opts=-Xmx600M -write -nrFiles 100 -fileSize 10

Prerequisites for compiling LZO

  1. Get the latest master branch tarball from http://www.github.com/kevinweil/hadoop-lzo
  2. Ensure you have jdk, ant, and gcc installed on your box. The jdk version needs to be the same as the jdk deployed on the Hadoop cluster.
  3. Ensure you have lzo and lzo-devel installed on your compile box and on the Hadoop cluster.
  4. Unpack the tarball from the first step.
  5. Run ant package in the unpacked directory.

Setting Up lzo on Hadoop

  1. Copy build/hadoop-lzo*.jar to /path/to/hadoop/lib on all the Hadoop nodes
  2. Copy build/hadoop-lzo*/lib/native to /path/to/hadoop/lib/ on all the Hadoop nodes
  3. Modify core-site.xml and mapred-site.xml:core-site.xml:
<property>  <name>io.compression.codecs</name>   <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec</value> </property><property>  <name>io.compression.codec.lzo.class</name>   <value>com.hadoop.compression.lzo.LzoCodec</value> </property>

mapred-site.xml:

<property>  <name>mapred.map.output.compression.codec</name>  <value>com.hadoop.compression.lzo.LzoCodec</value> </property><property> <name>mapred.output.compression.codec</name>  <value>com.hadoop.compression.lzo.LzoCodec</value> </property>

Restart the Hadoop cluster (dfs and mapred).

Setting Up lzo on das

  1. Copy build/hadoop-lzo*.jar to /path/to/das/etc/custom-jars on the conductor.
  2. Copy build/hadoop-lzo*/lib/native to /path/to/das/lib/ on the conductor. No configuration modification is needed for das to pick up lzo compression.
  3. Restart das.

LZO Compression on a Mac

Contact your Datameer customer support representative. 

Make a Working Directory

mkdir $workdir
cd $workdir

On a different OS, get a C compiler if you don't have one

sudo apt-get install build-essential

Get the LZO library

wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.03.tar.gz
tar xzf lzo-2.03.tar.gz
cd lzo-2.03
./configure --prefix=/opt/local
make
sudo make install
cd ..

Download and install the LZO libraries

sudo aptitude search lzo
You see something like this:
>  liblzo2-2
>   liblzo2-dev
>   lzop

You might need to to perform this if the install fails:

sudo aptitude update
(or) sudo apt-get update


Install this with:

sudo aptitude install liblzo2-2 liblzo2-dev

Build hadoop-lzo

export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.7.0/Home
sudo apt-get install git
git clone git://github.com/kevinweil/hadoop-lzo.git
cd hadoop-lzo
sudo apt-get install ant
ant clean compile-native test tar
cd ..

Install and configure Hadoop

wget http://mirror.synyx.de/apache//hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz
tar xf hadoop-0.20.2.tar.gz
ln -s hadoop-0.20.2 hadoop
cd hadoop

Download and install the lzo-libraries as above, for each Hadoop node.

Change JAVA_HOME

nano -w conf/hadoop-env.sh
nano -w conf/core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
nano -w conf/hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
nano -w conf/mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
cd $workdir/hadoop-lzo
mkdir ../hadoop/lib/native/Mac_OS_X-x86_64-64
cp build/native/Mac_OS_X-x86_64-64/lib/* ../hadoop/lib/native/Mac_OS_X-x86_64-64/
cp build/hadoop-lzo-0.4.4.jar ../hadoop/

Formatting HDFS

cd $workdir/hadoop
bin/hadoop namenode -format

Starting Hadoop

bin/start-all.sh

Install and configure das

Install and configure das as shown in the Installation Guide, but don't start the application:

cd $workdir/hadoop-lzo
cp build/hadoop-lzo-0.4.4.jar ../das-0.40/etc/custom-jars/
mkdir ../das-0.40/lib/native/Mac_OS_X-x86_64-64
cp build/native/Mac_OS_X-x86_64-64/lib/* ../das-0.40/lib/native/Mac_OS_X-x86_64-64/

Now you can start the application:

cd $workdir/das-0.40/
bin/conductor.sh start

Configure the application to use LZO compression, as described above, then test on the command line and with das.

Hadoop, Hive, Datameer, and LZO

Environment

Create a directory on your desktop that contains the Datameer distribution, Hadoop distribution, and Hive distribution:

cd ~/Desktop
mkdir DAS

Set up Java to include header files:

sudo ln -s /System/Library/Frameworks/JavaVM.framework/Headers /System/Library/Frameworks/JavaVM.framework/Versions/1.7/Home/include

Install LZO dependencies

LZO C-Libraries

You need the following LZO c-libraries:

  1. LZO c-binaries to be able to (de)compress data on your system
  2. LZOP a cmd tool to (de)compress files

Do this with HomeBrew or MacPorts

For HomeBrew use:

brew search lzo
brew install lzo
brew install lzop

Create a symlink on your machine to link lzo libraries into another folder:

cd /usr/local
ln -s /usr/local/Cellar/lzo/2.06 lzo64

For MacPorts use:

sudo port install lzo2
sudo port install lzop

LZO Hadoop-Libraries

For HomeBrew use:

cd ~/Desktop/DAS
git clone git://github.com/kevinweil/hadoop-lzo.git
cd hadoop-lzo
ant clean compile-native test tar

For MacPorts use:

cd ~/Desktop/DAS 
git clone git://github.com/kevinweil/hadoop-lzo.git 
cd hadoop-lzo
env JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.7/Home/ \
C_INCLUDE_PATH=/opt/local/include \
LIBRARY_PATH=/opt/local/lib \
CFLAGS="-arch x86_64" ant clean compile-native test tar

Find the Java library under:

build/hadoop-lzo-0.4.15/hadoop-lzo-0.4.15.jar

Find the native libraries under:

build/hadoop-lzo-0.4.15/lib/native/Mac_OS_X-x86_64-64

Install and set up Hadoop and Hive with LZO

Install and set up a pseudo Hadoop distribution

Download and extract Hadoop 0.20.2:

cd ~/Desktop/DAS
curl -O http://apache.easy-webs.de/hadoop/common/hadoop-0.20.2/hadoop-0.20.2.tar.gz
gunzip hadoop-0.20.2.tar.gz 
tar -xf hadoop-0.20.2.tar
rm hadoop-0.20.2.tar
cp -r hadoop-0.20.2 pseudo-lzo-hadoop-0.20.2
cd pseudo-lzo-hadoop-0.20.2 

Edit an env script and add Java Home:

conf/hadoop-env.sh
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.7/Home

Edit xml config files to create a pseudo-Hadoop that works with hdfs and add lzo support:

conf/core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:9000</value>
</property>

<property>
  <name>fs.checkpoint.dir</name>
  <value>YOUR_USER_HOME/Desktop/DAS/hadoop-data/dfs/secondary</value>
</property>

<property>
  <name>hadoop.tmp.dir</name>
  <value>YOUR_USER_HOME/Desktop/DAS/hadoop-data/tmp</value>
</property>


<property>
    <name>io.compression.codecs</name>       <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value>
  </property>

<property>
  <name>io.compression.codec.lzo.class</name>
  <value>com.hadoop.compression.lzo.LzoCodec</value>
</property>

<property>
  <name>mapred.compress.map.output</name>
  <value>true</value>
</property>
  
<property>
  <name>mapred.map.output.compression.codec</name>
  <value>com.hadoop.compression.lzo.LzoCodec</value>
</property>

</configuration>
conf/hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
  <name>dfs.replication</name>
  <value>1</value>
</property>

<property>
  <name>dfs.permissions</name>
  <value>false</value>
</property>

<property>
  <name>dfs.data.dir</name>
  <value>YOUR_USER_HOME/Desktop/DAS/hadoop-data/dfs/data</value>
</property>

<property>
  <name>dfs.name.dir</name>
  <value>YOUR_USER_HOME/Desktop/DAS/hadoop-data/dfs/name</value>
</property>

</configuration>
conf/mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

 <property>
   <name>mapred.job.tracker</name>
   <value>localhost:9001</value>
 </property>

<property>
  <name>mapred.local.dir</name>
  <value>YOUR_USER_HOME/Desktop/DAS/hadoop-data/mapred/local</value>
</property>

<property>
  <name>mapred.system.dir</name>
  <value>YOUR_USER_HOME/Desktop/DAS/hadoop-data/mapred/system</value>
</property>

<property>
  <name>mapred.tasktracker.map.tasks.maximum</name>
  <value>7</value>
</property>

<property>
  <name>mapred.tasktracker.reduce.tasks.maximum</name>
  <value>3</value>
</property>


</configuration>

Copy the native libs from the hadoop-lzo project into the native lib folder of the Hadoop dist:

 mkdir ~/Desktop/DAS/pseudo-lzo-hadoop-0.20.2/lib/native/Mac_OS_X-x86_64-64
 cp ~/Desktop/DAS/hadoop-lzo/build/hadoop-lzo-0.4.15/lib/native/Mac_OS_X-x86_64-64/* ~/Desktop/DAS/pseudo-lzo-hadoop-0.20.2/lib/native/Mac_OS_X-x86_64-64

Copy the Hadoop LZO jar into the lib folder of your Hadoop dist:

cp ~/Desktop/DAS/hadoop-lzo/build/hadoop-lzo-0.4.15/hadoop-lzo-0.4.15.jar ~/Desktop/DAS/pseudo-lzo-hadoop-0.20.2/lib

Format your file system and start Hadoop:

cd ~/Desktop/DAS/pseudo-lzo-hadoop-0.20.2
bin/hadoop namenode -format
bin/start-all.sh

Compress a file and upload it to HDFS

Download the file books.txt and compress this file to get a lzo file:

lzop books.txt

Compress the file with lzop.

An output file called books.txt.lzo should be created. Upload this file to the hdfs. Later, create an import job for this file to see if Hadoop and Datameer work.

 bin/hadoop fs -mkdir /books
 bin/hadoop fs -copyFromLocal books.txt.lzo /books/

To import this file into a Hive table, create an index file:

bin/hadoop jar lib/hadoop-lzo-0.4.15.jar com.hadoop.compression.lzo.LzoIndexer /books/books.txt.lzo

When you print the file with Hadoop you should see the following encrypted content:

bin/hadoop fs -cat /books/books.txt.lzo

Install and set up Hive and connect to the LZO cluster

Download Hive and extract the archive to your ~/Desktop/DAS folder. Open the conf/hive-default.xml file and change the following property:

conf/hive-default.xml
<property>
  <name>hive.metastore.warehouse.dir</name>
  <value>YOUR_USER_HOME/Desktop/DAS/hive-warehouse</value>
</property>

Copy the hadoop-lzo.jar to the lib folder of the Hive installation:

cp ~/Desktop/DAS/hadoop-lzo/build/hadoop-lzo-0.4.15/hadoop-lzo-0.4.15.jar ~/Desktop/DAS/hive-0.7.1/lib/

Create a table based that supports lzo and link the books.lzo file:

cd ~/Desktop/DAS/hive-0.7.1
export HADOOP_HOME=~/Desktop/DAS/pseudo-lzo-hadoop-0.20.2 
bin/hive
CREATE EXTERNAL TABLE books(userId BIGINT,userName STRING,books ARRAY<STRING>,orderDetails MAP<STRING, STRING>) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY '-' MAP KEYS TERMINATED BY '.' STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat" OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat" LOCATION '/books';

Verify the Hive is working with your connected Hadoop cluster. You should see a non-compressed output instead of lzo content.

select * from books;

Start Hive as a service. First, quit the Jive console by clicking CTRL-C.

bin/hive --service metastore

Install and set up Datameer and connect to the LZO cluster

Download and install a Datameer distribution and copy the native Hadoop lzo libs and the Hadoop lzo jar into the related folders using the following code:

cd ~/Desktop/DAS/hadoop-lzo
cp build/hadoop-lzo-0.4.15.jar ~/Desktop/DAS/das/etc/custom-jars/
mkdir ~/Desktop/DAS/das/lib/native/Mac_OS_X-x86_64-64
cp build/native/Mac_OS_X-x86_64-64/lib/* ~/Desktop/DAS/das/lib/native/Mac_OS_X-x86_64-64/

Edit the das-common.properties file and add compression info and files that shouldn't be imported into Datameer using the hidden-file-suffix property. Add the .index file you created before.

conf/das-common.properties
io.compression.codecs=org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec
io.compression.codec.lzo.class=com.hadoop.compression.lzo.LzoCodec
das.import.hidden-file-suffixes=.index

Start Datameer and connect to your pseudo-Hadoop cluster.

bin/conductor.sh start

On the Administration tab, connect to your Hadoop cluster:

  • Create an file import job to the lzo folder that exists on your cluster under /books/*. You should see no garbage in preview and snapshot view.
  • Create an Hive import job and import the books table. You should see no garbage in preview and snapshot mode.

Building Hadoop Native Libs on OSX

Problem

One pain point associated with developing for Hadoop on OS X is lack of native binary support. This can be especially irritating as there is a bug with the non-native GZip codec which blocks DAS on OSX from reading SequenceFiles with BLOCK or RECORD compression enabled. If you are trying to process a SequenceFile and see an exception such as the following:

Caused by: java.io.EOFException

at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:249)
at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:239)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1412)

Then you are likely running into this issue.

The 64-bit OSX native libs are now checked into the project and distributed with DAS, so building the native libs shouldn't be necessary anymore unless you are running on 32 bit OSX.

Solution

You must have XCode installed with linux development tools in order to build these libraries.

Download the Apache distribution, 0.20.2 and extract the archive somewhere on your machine.

tar xzvf hadoop-0.20.2.tar.gz
cd hadoop-0.20.2/

Download the patch file from https://issues.apache.org/jira/browse/HADOOP-3659 and place it in the root of the Hadoop distriubtion (hadoop-0.20.2). Next, apply the patch and compile the native libraries:

cd hadoop-0.20.2/
patch -p0 < hadoop-0.20-compile-native-on-mac.patch
ant -Dcompile.native=true compile

This process generates the native libraries in a directory under build/native:

dobbsferry:hadoop-0.20.2$ ls build/native/Mac_OS_X-x86_64-64/lib/
Makefile libhadoop.1.0.0.dylib libhadoop.1.dylib libhadoop.a libhadoop.dylib libhadoop.la

Now, pass the location of these libs to DAS on startup:

ANT_OPTS="-Xmx512m -XX:MaxPermSize=384m -Ddb.mode=mysql -Djava.library.path=~/resources/hadoop-0.20.2/build/native/Mac_OS_X-x86_64-64/lib/" ant jetty.run

Notes

This process has only been tested and verified with the Apache Hadoop distribution. See http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/NativeLibraries.html for more details on native libraries.