Wednesday, December 31, 2014

Nostalgia in Clustering

The close of 2014 made me remember an old clustering project I did around 2004 on a shoestring budget. The project was correlating customers into families, with a sub-task of deduplicating customer records (from typos and other issues). The entire project team was… me.

I gathered up a server with a couple of old hard drives as a mySQL server and PXE boot server, and four other computers PXE-booting into linux with openMOSIX for clustering. I didn’t have budget for cases for the four slaves, so used old cookie sheets to mount them. I used wooden dowels to fix two cookie sheet nodes together so they could sit vertically.

My processing was done in perl. Once OpenMOSIX reported a slave was free, a perl process would spawn and grab a workload from mySQL. OpenMOSIX would migrate the process to the open slave.

Fortunately I was able to complete the project with only four slaves. I had figured out my power supplies could power two slaves. I was working on converting a couple of ATX power supply extension cables into a Y-splitter and only using one power supply per "cookie".

I found some pictures of the nodes from an old presentation:


Sunday, July 6, 2014

Custom Writable

I have never tackled a custom Writable before. I am a huge Avro (http://avro.apache.org/) fan, so I usually try to get my data converted to avro early. A discussion got me interested in tackling it and I had my Bouncy Castle ASN.1 Hadoop example open, so I extended that to a basic custom Writable example.

This thread by Oded Rosen was invaluable:
http://mail-archives.apache.org/mod_mbox/hadoop-general/201005.mbox/%3CAANLkTinzP8-nnGg8Q5aaJ8gXCCg6Som7e8Xarc_2PGDD@mail.gmail.com%3E
(also at http://osdir.com/ml/general-hadoop-apache/2010-05/msg00073.html if above is down)

I put the code in package com.awcoleman.BouncyCastleGenericCDRHadoopWithWritable in github.

The basics from the thread above and a bit of other reading are:
If your class will only be used as a value and not a key, implement the Writable interface.
If your class will be used as a key (and possibly a value), implement the WritableComparable interface (which extends Writable).

A Writable must have 3 things:
An empty contructor. There can be other contructors with arguments, but there must be a no argument one as well.
An overridden write method to write variables out.
An overridden readFields method to populate an object from a previous write method output.

Hadoop reuses Writable objects, so cleaning all variables before populating them in readFields will stop surprises.

WritableComparable adds to Writable:
An overridden hashcode method to partition keys.
An overridden compareTo method.

The advice given in the 'How to write a complex Writable' thread adds:
Override the equals method
Implement RawComparator for your type. This post (http://vangjee.wordpress.com/2012/03/30/implementing-rawcomparator-will-speed-up-your-hadoop-mapreduce-mr-jobs-2/) has an example that extends WritableComparator, which implements RawComparator.

In my example in github, I only tested Writable since I pull individual fields and wrap them as Text or LongWritable for the keys.


Wednesday, July 2, 2014

Processing ASN.1 Call Detail Records with Hadoop (using Bouncy Castle) Part 3

Finally we get to the Hadoop Map/Reduce job...

We created the data and created a simple decoder to test, so now we can take the decoding logic and put it in a RecordReader.

The InputFormat we create is very simple - set isSplitable false and use our RecordReader named RawFileRecordReader.


The RecordReader does the bulk of the work.


RawFileRecordReader simply returns the filename and the count of the ASN.1 records in the file. We can change that to something more useful in a later post.

The Driver is also simple.


Get the full code on github. The code here is the simplest way to handle binary data files. There are lots of things to add for better performance. If the data files are large enough, adding in splitting logic may be worthwhile. If the data files are small, it may be worth using a map job to group them into sequence files, or convert them into avro files.

Update: Links to Part 1Part 2Part 3.

Monday, June 16, 2014

Processing ASN.1 Call Detail Records with Hadoop (using Bouncy Castle) Part 2

The Stand-alone Decoder

Now that we have created sample data, we can create a simple decoder with the Bouncy Castle library.


The decompressStream method is a little overkill, but will let the sample data be compressed and handle it fine. This causes a dependency on commons-compress but can also be removed easily (just change to return input).

To iterate through the ASN.1 file, we keep grabbing objects from ASN1InputStream with readObject. Once we have an object, we use it to create a CallDetailRecord instance.


Using Bouncy Castle requires some digging into the data format to get the expected set of classes. Now that the decoder is complete, we can move on to the Map/Reduce job. We didn't have to create a decoder and could have jumped straight into the Map/Reduce job, but creating a simple decoder for the first time I tackle a binary format has always saved me time.

Update: Links to Part 1Part 2Part 3.

Monday, May 26, 2014

Processing ASN.1 Call Detail Records with Hadoop (using Bouncy Castle)

In these posts I describe using the Bouncy Castle java library to process Call Detail Records (CDRs) in ASN.1 format (encoded as DER). The same process should work for any ASN.1 data encoded as DER.

I hope to replicate this with an ASN.1 Java compiler (BinaryNotes), but right now bnotes does not handle indefinite length. With an ASN.1 compiler, the compiler will create the Java classes from the ASN.1 specification so I don't have to manually create the classes to hold the data.

Creating some data

First, I need a specification I can work with and post. I created a "Simple Generic CDR" ASN.1 specification/schema:

GenericCDR-Schema DEFINITIONS IMPLICIT TAGS ::=
BEGIN
GenericCallDataRecord ::= SEQUENCE {
recordNumber [APPLICATION 2] IMPLICIT INTEGER,
callingNumber [APPLICATION 8] IMPLICIT UTF8String (SIZE(1..20)),
calledNumber [APPLICATION 9] IMPLICIT UTF8String (SIZE(1..20)),
startDate [APPLICATION 16] IMPLICIT  UTF8String (SIZE(8)),
startTime [APPLICATION 18] IMPLICIT UTF8String (SIZE(6)),
duration [APPLICATION 19] IMPLICIT INTEGER
}
END

For production data, the ASN.1 specification (also called grammar) would come from the vendor producing the data.

The awesome OSS Nokalva people have an online schema checker/compiler and data encoder/decoder. (If you are looking for support, I think you can fairly easily switch out Bouncy Castle with OSS Nokalva's ASN.1 Tools for Java but I haven't tried it).

To create data, paste the above schema into the Schema textbox at asn1-playground.oss.com and press Compile. "Compiled successfully." should show up below the textbox. If not the Console Output textbox on the page should give some clues to the problem.

Next paste in some text-formatted data to compile. In the Data: Encode text box, paste in:

first-cdr GenericCallDataRecord ::=
{
recordNumber 1,
callingNumber "15555550100",
calledNumber "15555550101",
startDate "20131016",
startTime "134534",
duration 65
}
second-cdr GenericCallDataRecord ::=
{
    recordNumber 2,
    callingNumber "15555550102",
    calledNumber "15555550104",
startDate "20131016",
startTime "134541",
duration 52
}
third-cdr GenericCallDataRecord ::=
{
    recordNumber 3,
    callingNumber "15555550103",
    calledNumber "15555550102",
startDate "20131016",
startTime "134751",
duration 62
}
fourth-cdr GenericCallDataRecord ::=
{
    recordNumber 4,
    callingNumber "15555550104",
    calledNumber "15555550102",
startDate "20131016",
startTime "134901",
duration 72
}
fifth-cdr GenericCallDataRecord ::=
{
    recordNumber 5,
    callingNumber "15555550101",
    calledNumber "15555550100",
startDate "20131016",
startTime "135134",
duration 32
}
And press Encode. The Console Output box should show 0 errors. To download the ASN.1 DER encoded data press the DER link below the Data: Encode textbox. The XML link is also nice to download since that is a human readable representation of the same data.

I put the files from all the encoding options in the asn1data folder of the github repo for this post.

Next I create a standalone decoder, then create a Hadoop InputFormat and RecordReader, and finally run the Hadoop job to process the ASN.1 DER-encoded data we just created above.

Update: Links to Part 1, Part 2, Part 3.

Saturday, March 22, 2014

Pivotal HD pre-2.0 HDFS-Fuse workaround


I found an issue with HDFS-Fuse on Pivotal HD 1.1 (and the workaround). Hopefully this saves someone a few minutes of troubleshooting.

With a simple
/usr/bin/hadoop-fuse-dfs dfs://MyNameNode:8020 /mnt/hdfs

I received
bash: /usr/bin/hadoop-fuse-dfs: /bin/base: bad interpreter: No such file or directory

There are some typos in the file /usr/bin/hadoop-fuse-dfs. The following sed commands fix the file:
#Make a backup copy first
cp -p /usr/bin/hadoop-fuse-dfs /tmp/hadoop-fuse-dfs.`date +'%Y%m%d.%H%M'`.bak

#Now fix typos
sed -i 's|^#!/bin/base|#!/bin/bash|' /usr/bin/hadoop-fuse-dfs
sed -i 's|^/sbin/modprob fuse|/sbin/modprobe fuse|' /usr/bin/hadoop-fuse-dfs
sed -i '/^# Autodetect JAVA_HOME if not definedif \[ -e \/usr\/libexec\/bigtop-detect-javahome \]; then/ a\
if [ -e /usr/libexec/bigtop-detect-javahome ]; then
' /usr/bin/hadoop-fuse-dfs
sed -i '/^. \/usr\/lib\/bigtop-utils\/bigtop-detect-javahome/ a\
fi
' /usr/bin/hadoop-fuse-dfs
sed -i 's|\\\$|$|g' /usr/bin/hadoop-fuse-dfs
sed -i 's|\\`|`|g' /usr/bin/hadoop-fuse-dfs
sed -i 's|\$/|/|g' /usr/bin/hadoop-fuse-dfs
Now hadoop-fuse-dfs works properly:
/usr/bin/hadoop-fuse-dfs dfs://MyNameNode:8020 /mnt/hdfs

ls /mnt/hdfs/
apps  benchmarks hive  mapred  test  tmp  user  yarn


Pivotal support said this should be fixed in Pivotal HD 2.0, until then the simple sed workaround works fine.

Another option for machines that are not part of the cluster (eg client machines accessing HDFS) is to use HDFS-Fuse from the Bigtop repo, however this may make support from Pivotal more difficult if issues arise.

Saturday, March 1, 2014

Testing Pivotal HD

I'm on a project where we are interfacing with EMC's Hadoop distribution Pivotal HD. Pivotal differentiates itself from the more traditional Hadoop distributions by adding on HAWQ (essentially Greenplum running on HDFS), USS (a remote datasource to Hadoop translator), and GemFire (an in-memory DB) - essentially it is "Hadoop with Greenplum on top". The SQL distributed query engine HAWQ looks extremely nice and brings all the maturity of Greenplum with it. On another project, I am working around some unsupported methods in the Hive JDBC driver. With HAWQ, I am able to use the standard Postgres JDBC driver (even psql from the RHEL/CentOS repos connects fine).

So far all my testing has been on local VMs so I don't have a feel for HAWQ performance. Once my work transitions to the cluster it will be interesting to see how it performs.

I normally shy away from closed source projects when there are good open source alternatives since I really like being able to grab the source and take a look at what is happening (among other reasons). I would be using Pivotal for this project regardless since that out of my hands. However, HAWQ seems to have a lead on SQL implementation versus the competitors (Stinger/Hive on Tez, Impala, and Presto).

Hopefully I'll have enough time to look at how a custom PXF (roughly equivalent to a Hive SerDe) compares to a custom InputFormat and Hive SerDe for a binary format like a telco switch cdr file.

Saturday, February 15, 2014

CC licensed SS7 ISUP Call Flow Diagram

I needed a simple diagram showing the ISUP messages during a call for a visual aid for an analysis. Google images didn't turn up any diagrams permissively licensed, so I created one. Hopefully it can save someone else time not reinventing the wheel (or in this case redrawing it...)

The telephone icon is from AIGA Symbol Signs by AIGA License: Public Domain
The switch icon is from Typicons by Stephen Hutchings  License: Creative Commons (Attribution-Share Alike 3.0 Unported)
Both are untouched.



Image: Author: Andrew Coleman License: Creative Commons Attribution 4.0 Unported (CC BY 4.0)
Hopefully Google Image will pick up the license since it is in the comment section of the png...