I'm on a project where we are interfacing with EMC's Hadoop distribution Pivotal HD. Pivotal differentiates itself from the more traditional Hadoop distributions by adding on HAWQ (essentially Greenplum running on HDFS), USS (a remote datasource to Hadoop translator), and GemFire (an in-memory DB) - essentially it is "Hadoop with Greenplum on top". The SQL distributed query engine HAWQ looks extremely nice and brings all the maturity of Greenplum with it. On another project, I am working around some unsupported methods in the Hive JDBC driver. With HAWQ, I am able to use the standard Postgres JDBC driver (even psql from the RHEL/CentOS repos connects fine).
So far all my testing has been on local VMs so I don't have a feel for HAWQ performance. Once my work transitions to the cluster it will be interesting to see how it performs.
I normally shy away from closed source projects when there are good open source alternatives since I really like being able to grab the source and take a look at what is happening (among other reasons). I would be using Pivotal for this project regardless since that out of my hands. However, HAWQ seems to have a lead on SQL implementation versus the competitors (Stinger/Hive on Tez, Impala, and Presto).
Hopefully I'll have enough time to look at how a custom PXF (roughly equivalent to a Hive SerDe) compares to a custom InputFormat and Hive SerDe for a binary format like a telco switch cdr file.
No comments:
Post a Comment