Monday, July 3, 2017

Giraph Error: Could not find or load main class org.apache.giraph.yarn.GiraphApplicationMaster

Very old post from 2014 that got lost in my drafts. Posting so hopefully this helps out someone.

Often Google acts like magic for me: type in my error, and out pops the solution. Not so for a Giraph error I recently hit. Hopefully this post lets Google work like magic for someone else :)

After installing Giraph on a BigTop 0.7 VM, I was able to run the benchmark that takes no input or output but nothing more complicated.

This works:
hadoop jar /usr/share/doc/giraph-1.0.0.5/giraph-examples-1.0.0-for-hadoop-2.0.6-alpha-jar-with-dependencies.jar org.apache.giraph.benchmark.PageRankBenchmark -Dgiraph.zkList=127.0.0.1:2181 -libjars /usr/lib/giraph/giraph-1.0.0-for-hadoop-2.0.6-alpha-jar-with-dependencies.jar -e 1 -s 3 -v -V 50 -w 1

But this:
hadoop jar /usr/share/doc/giraph-1.0.0.5/giraph-examples-1.0.0-for-hadoop-2.0.6-alpha-jar-with-dependencies.jar org.apache.giraph.GiraphRunner -Dgiraph.zkList=127.0.0.1:2181 -libjars /usr/lib/giraph/giraph.jar org.apache.giraph.examples.SimpleShortestPathsVertex -vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip /user/acoleman/giraphtest/tiny_graph.txt -of org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /user/acoleman/giraphtest/shortestpathsC2 -ca SimpleShortestPathsVertex.source=2 -w 1

does not.

Looking at the latest container logs with:
cat $(ls -1rtd $(ls -1rtd /var/log/hadoop-yarn/containers/application_* | tail -1)/container_* | tail -1)/*

I find:
Error: Could not find or load main class org.apache.giraph.yarn.GiraphApplicationMaster

I beat my head against the wall trying to add to libjars, to -yj, copying jars into every directory i could find.

I stumbled across
http://mail-archives.apache.org/mod_mbox/giraph-user/201312.mbox/%3C198091226.KO6f1kuK42@chronos7%3E

which gives the answer. If https://issues.apache.org/jira/browse/GIRAPH-814 hasn't been applied, then mapreduce.application.classpath has to be hard set or Giraph simply won't work.

vi /etc/hadoop/conf.pseudo/mapred-site.xml
  <property>
    <name>mapreduce.application.classpath</name>
    <value>/usr/lib/hadoop-mapreduce/*,/usr/lib/hadoop-mapreduce/lib/*,/usr/lib/giraph/giraph-1.0.0-for-hadoop-2.0.6-alpha-jar-with-dependencies.jar
    </value>
  </property>

I did not need to restart yarn-resourcemanager or yarn-nodemanager for this to get picked up.

DropWizard and Hive (and/or Impala)

I have a small DropWizard/D3.js/jqGrid application to visualize the results of some analysis. I had been taking the results of the analysis from hdfs and shoveling it into mySQL (with sqoop) to examine samples. This is working well enough that I wanted to go straight to the source. With DropWizard this should be easy enough to wrap my data in a Hive external table and use the Hive JDBC driver instead of mySQL.

If you are already familiar with DropWizard and just need an example, examine the pom.xml and config-hive.yaml files in my example application on GitHub.

To pull in Hive JDBC and its dependencies, add to pom.xml:

       
  <dependency>
   <groupid>org.apache.hive</groupid>
   <artifactid>hive-jdbc</artifactid>
   <version>1.1.0</version>
   <exclusions>
    <exclusion>
     <groupid>org.slf4j</groupid>
     <artifactid>slf4j-log4j12</artifactid>
    </exclusion>
    <exclusion>
     <groupid>com.sun.jersey</groupid>
     <artifactid>*</artifactid>
    </exclusion>
   </exclusions>
  </dependency>
  <dependency>
   <groupid>org.apache.hadoop</groupid>
   <artifactid>hadoop-common</artifactid>
   <version>2.6.0</version>
   <exclusions>
    <exclusion>
     <groupid>org.slf4j</groupid>
     <artifactid>slf4j-log4j12</artifactid>
    </exclusion>
    <exclusion>
     <groupid>com.sun.jersey</groupid>
     <artifactid>*</artifactid>
    </exclusion>
   </exclusions>
  </dependency>

      
 



Sunday, June 18, 2017

Pinebook!

Not Hadoop-related, but awesome all the same. A few months ago I stumbled on PINE64's website and saw the pinebook, a linux arm64 laptop. That and a PocketCHIP made a great late-birthday, early-father's-day set of presents.

Build and shipping takes a couple months, shipping was almost 1/3 of the laptop cost, and performance and keyboard quality is exactly what you would expect :) But it is still a fun bit of hardware.

If you decide to get one, make sure to add on a USB-to-H-barrel power cord (or make your own). The pinebook does come with a power supply, but no point in carting around yet another wall-wart when the pinebook happily charges off a phone charger.

Mine powered right up into Xenial. I'm normally RH-based since everywhere I've been employed in the last couple decades has been, so it's nice to jump back into Debian-based.

aarch64 wasn't in mainline rust, but was in nursery, so
curl -sSf https://raw.githubusercontent.com/rust-lang-nursery/rustup.rs/master/rustup-init.sh | bash
worked just fine and got me up and going with rust.




Update: HackADay has a great write-up. I didn't experience any of the screen issues they had since I have the 14", but the page has a great tear-down and overview of performance (which is not much :) )

Thursday, February 23, 2017

Writing ORC files is easier than a few years ago

Several years ago I was asked to compare writing Parquet and ORCFile formats from standalone java (without using the Hadoop libraries). At the time ORC was not separated from Hive and it was much more involved than writing Parquet from java. It looks like that changed in 2015 but I only revisited the issue within the past few months.

To build ORC:
Download the current release (currently 1.3.2)
tar xzvf orc-1.3.2.tar.gz && cd ./orc-1.3.2/
cd ./java
mvn package

ls -la ./tools/target/orc-tools-1.3.2-uber.jar

A simple example of writing is:

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
import org.apache.orc.OrcFile;
import org.apache.orc.TypeDescription;
import org.apache.orc.Writer;
/*
* Basic code from https://orc.apache.org/docs/core-java.html#writing-orc-files
* Using Core Java - Writing ORC Files
*
*
* orc-tools-X.Y.Z-uber.jar is required in the runtime classpath for io/airlift/compress/Decompressor
*
* Creates myfile.orc AND .myfile.orc.crc, fails if myfile.orc exists.
*
* awcoleman@gmail.com
*/
public class WriteORCFileWORCCore {
public WriteORCFileWORCCore() throws IllegalArgumentException, IOException {
String outfilename = "/tmp/myfile.orc";
Configuration conf = new Configuration(false);
/*
* Writer is in orc-core-1.2.1.jar and has dependencies on the
* Hadoop HDFS client libs
*/
TypeDescription schema = TypeDescription.fromString("struct<x:int,y:int>");
Writer writer = OrcFile.createWriter(new Path(outfilename),
OrcFile.writerOptions(conf)
.setSchema(schema));
/*
* VectorizedRowBatch and LongColumnVector are in hive-storage-api-2.1.1-pre-orc.jar
*/
VectorizedRowBatch batch = schema.createRowBatch();
LongColumnVector x = (LongColumnVector) batch.cols[0];
LongColumnVector y = (LongColumnVector) batch.cols[1];
for(int r=0; r < 10000; ++r) {
int row = batch.size++;
x.vector[row] = r;
y.vector[row] = r * 3;
// If the batch is full, write it out and start over.
if (batch.size == batch.getMaxSize()) {
writer.addRowBatch(batch);
batch.reset();
}
}
//write last partial batch out
writer.addRowBatch(batch);
//close
writer.close();
//Output info to console
System.out.println("Wrote "+writer.getNumberOfRows()+" records to ORC file "+(new Path(outfilename).toString()));
}
public static void main(String[] args) throws IllegalArgumentException, IOException {
@SuppressWarnings("unused")
WriteORCFileWORCCore mainobj = new WriteORCFileWORCCore();
}
}

And a simple example of reading is:

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hive.ql.exec.vector.LongColumnVector;
import org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch;
import org.apache.orc.OrcFile;
import org.apache.orc.Reader;
import org.apache.orc.RecordReader;
import org.apache.orc.TypeDescription;
/*
* Basic code from https://orc.apache.org/docs/core-java.html#reading-orc-files
* Using Core Java - Reading ORC FilesPermalink
*
* Reads ORC file written by WriteORCFileWORCCore
*
* orc-tools-X.Y.Z-uber.jar is required in the runtime classpath for io/airlift/compress/Decompressor
*
* awcoleman@gmail.com
*/
public class ReadORCFileWORCCore {
public ReadORCFileWORCCore() throws IllegalArgumentException, IOException {
String infilename = "/tmp/myfile.orc";
Configuration conf = new Configuration(false);
Reader reader = OrcFile.createReader(new Path(infilename),
OrcFile.readerOptions(conf));
RecordReader rows = reader.rows();
VectorizedRowBatch batch = reader.getSchema().createRowBatch();
//Some basic info about the ORC file
TypeDescription schema = reader.getSchema();
long numRecsInFile = reader.getNumberOfRows();
System.out.println("Reading ORC file "+(new Path(infilename).toString()));
System.out.println("ORC file schema: "+schema.toJson());
System.out.println("Number of records in ORC file: "+numRecsInFile);
while (rows.nextBatch(batch)) {
System.out.println("Processing Batch of records from ORC file. Number of records in Batch: "+batch.size);
LongColumnVector field1 = (LongColumnVector) batch.cols[0];
LongColumnVector field2 = (LongColumnVector) batch.cols[1];
for(int r=0; r < batch.size; ++r) {
int field1rowr = (int) field1.vector[r];
int field2rowr = (int) field2.vector[r];
System.out.println("In this batch, for row "+r+" in this batch, field1 is: "+field1rowr+" and field2 is: "+field2rowr);
}
}
rows.close();
}
public static void main(String[] args) throws IllegalArgumentException, IOException {
@SuppressWarnings("unused")
ReadORCFileWORCCore mainObj = new ReadORCFileWORCCore();
}
}