Andrew Coleman: July 2014

I have never tackled a custom Writable before. I am a huge Avro (http://avro.apache.org/) fan, so I usually try to get my data converted to avro early. A discussion got me interested in tackling it and I had my Bouncy Castle ASN.1 Hadoop example open, so I extended that to a basic custom Writable example.

This thread by Oded Rosen was invaluable:
http://mail-archives.apache.org/mod_mbox/hadoop-general/201005.mbox/%3CAANLkTinzP8-nnGg8Q5aaJ8gXCCg6Som7e8Xarc_2PGDD@mail.gmail.com%3E
(also at http://osdir.com/ml/general-hadoop-apache/2010-05/msg00073.html if above is down)

I put the code in package com.awcoleman.BouncyCastleGenericCDRHadoopWithWritable in github.

The basics from the thread above and a bit of other reading are:
If your class will only be used as a value and not a key, implement the Writable interface.
If your class will be used as a key (and possibly a value), implement the WritableComparable interface (which extends Writable).

A Writable must have 3 things:
An empty contructor. There can be other contructors with arguments, but there must be a no argument one as well.
An overridden write method to write variables out.
An overridden readFields method to populate an object from a previous write method output.

Hadoop reuses Writable objects, so cleaning all variables before populating them in readFields will stop surprises.

WritableComparable adds to Writable:
An overridden hashcode method to partition keys.
An overridden compareTo method.

The advice given in the 'How to write a complex Writable' thread adds:
Override the equals method
Implement RawComparator for your type. This post (http://vangjee.wordpress.com/2012/03/30/implementing-rawcomparator-will-speed-up-your-hadoop-mapreduce-mr-jobs-2/) has an example that extends WritableComparator, which implements RawComparator.

In my example in github, I only tested Writable since I pull individual fields and wrap them as Text or LongWritable for the keys.

Finally we get to the Hadoop Map/Reduce job...

We created the data and created a simple decoder to test, so now we can take the decoding logic and put it in a RecordReader.

The InputFormat we create is very simple - set isSplitable false and use our RecordReader named RawFileRecordReader.

The RecordReader does the bulk of the work.

RawFileRecordReader simply returns the filename and the count of the ASN.1 records in the file. We can change that to something more useful in a later post.

The Driver is also simple.

Get the full code on github. The code here is the simplest way to handle binary data files. There are lots of things to add for better performance. If the data files are large enough, adding in splitting logic may be worthwhile. If the data files are small, it may be worth using a map job to group them into sequence files, or convert them into avro files.

Update: Links to Part 1, Part 2, Part 3.

Andrew Coleman

Sunday, July 6, 2014

Custom Writable

Wednesday, July 2, 2014

Processing ASN.1 Call Detail Records with Hadoop (using Bouncy Castle) Part 3