Wednesday, July 2, 2014

Processing ASN.1 Call Detail Records with Hadoop (using Bouncy Castle) Part 3

Finally we get to the Hadoop Map/Reduce job...

We created the data and created a simple decoder to test, so now we can take the decoding logic and put it in a RecordReader.

The InputFormat we create is very simple - set isSplitable false and use our RecordReader named RawFileRecordReader.


The RecordReader does the bulk of the work.


RawFileRecordReader simply returns the filename and the count of the ASN.1 records in the file. We can change that to something more useful in a later post.

The Driver is also simple.


Get the full code on github. The code here is the simplest way to handle binary data files. There are lots of things to add for better performance. If the data files are large enough, adding in splitting logic may be worthwhile. If the data files are small, it may be worth using a map job to group them into sequence files, or convert them into avro files.

Update: Links to Part 1Part 2Part 3.

No comments:

Post a Comment