Saturday, March 22, 2014

Pivotal HD pre-2.0 HDFS-Fuse workaround


I found an issue with HDFS-Fuse on Pivotal HD 1.1 (and the workaround). Hopefully this saves someone a few minutes of troubleshooting.

With a simple
/usr/bin/hadoop-fuse-dfs dfs://MyNameNode:8020 /mnt/hdfs

I received
bash: /usr/bin/hadoop-fuse-dfs: /bin/base: bad interpreter: No such file or directory

There are some typos in the file /usr/bin/hadoop-fuse-dfs. The following sed commands fix the file:
#Make a backup copy first
cp -p /usr/bin/hadoop-fuse-dfs /tmp/hadoop-fuse-dfs.`date +'%Y%m%d.%H%M'`.bak

#Now fix typos
sed -i 's|^#!/bin/base|#!/bin/bash|' /usr/bin/hadoop-fuse-dfs
sed -i 's|^/sbin/modprob fuse|/sbin/modprobe fuse|' /usr/bin/hadoop-fuse-dfs
sed -i '/^# Autodetect JAVA_HOME if not definedif \[ -e \/usr\/libexec\/bigtop-detect-javahome \]; then/ a\
if [ -e /usr/libexec/bigtop-detect-javahome ]; then
' /usr/bin/hadoop-fuse-dfs
sed -i '/^. \/usr\/lib\/bigtop-utils\/bigtop-detect-javahome/ a\
fi
' /usr/bin/hadoop-fuse-dfs
sed -i 's|\\\$|$|g' /usr/bin/hadoop-fuse-dfs
sed -i 's|\\`|`|g' /usr/bin/hadoop-fuse-dfs
sed -i 's|\$/|/|g' /usr/bin/hadoop-fuse-dfs
Now hadoop-fuse-dfs works properly:
/usr/bin/hadoop-fuse-dfs dfs://MyNameNode:8020 /mnt/hdfs

ls /mnt/hdfs/
apps  benchmarks hive  mapred  test  tmp  user  yarn


Pivotal support said this should be fixed in Pivotal HD 2.0, until then the simple sed workaround works fine.

Another option for machines that are not part of the cluster (eg client machines accessing HDFS) is to use HDFS-Fuse from the Bigtop repo, however this may make support from Pivotal more difficult if issues arise.

Saturday, March 1, 2014

Testing Pivotal HD

I'm on a project where we are interfacing with EMC's Hadoop distribution Pivotal HD. Pivotal differentiates itself from the more traditional Hadoop distributions by adding on HAWQ (essentially Greenplum running on HDFS), USS (a remote datasource to Hadoop translator), and GemFire (an in-memory DB) - essentially it is "Hadoop with Greenplum on top". The SQL distributed query engine HAWQ looks extremely nice and brings all the maturity of Greenplum with it. On another project, I am working around some unsupported methods in the Hive JDBC driver. With HAWQ, I am able to use the standard Postgres JDBC driver (even psql from the RHEL/CentOS repos connects fine).

So far all my testing has been on local VMs so I don't have a feel for HAWQ performance. Once my work transitions to the cluster it will be interesting to see how it performs.

I normally shy away from closed source projects when there are good open source alternatives since I really like being able to grab the source and take a look at what is happening (among other reasons). I would be using Pivotal for this project regardless since that out of my hands. However, HAWQ seems to have a lead on SQL implementation versus the competitors (Stinger/Hive on Tez, Impala, and Presto).

Hopefully I'll have enough time to look at how a custom PXF (roughly equivalent to a Hive SerDe) compares to a custom InputFormat and Hive SerDe for a binary format like a telco switch cdr file.