Wednesday, June 16, 2021

Thursday, June 10, 2021

Trino and Zero-Length Parquet Files in HDFS Part 2

Continuing from Part 1.

The test application to write parquet files over the course of an hour as on github at https://github.com/awcoleman/example_trino_hdfs_zero_length

We can run the app in quick mode to populate some hive-style directories with data:



Create a table from those hive-style directories:



And tell the Hive Metastore about these partitions:



(Here we just tell hivemetastore to refresh all partitions, we could have just added single partitions)


And query data with Trino:



Now imagine if a legacy application opened a parquet file in an older partition directory and held that file open while waiting to see if there was any more incoming old data. We can use our test application to simulate that:



We can see the hdfs directory now has another file:



If we run the same query in Trino again, we get an error:


And the Trino server.log shows us the issue is in the footer:


Queries in other directories without open parquet files work fine:



Trino and Zero-Length Parquet Files in HDFS Part 1

 Trino (formerly Presto) is a great distributed query engine. It allows one to use SQL to query data in parquet files.

Parquet files have file metadata in a footer at the end of the file. The footer is written when the parquet file is closed.


I have a client with a legacy application that writes parquet files to HDFS or S3 in a hive partition structure. Parquet files written to S3 do not exist until they are closed, however those written to HDFS show as zero-length files until closed. This can be a problem for Trino since the parquet footer has not been written yet.


Trino can use Hive Metastore as the metastore to hold information linking tables to data files. Hive Metastore partitions operate at the directory level, not file level. This means that all the files listed in a directory are shown as part of the partition.


In later posts, we'll look at Apache Iceberg as an alternative to Hive Metastore to avoid this issue.


In the next posts, we set up a test environment to show this happening.

Wednesday, May 9, 2018

Apache Zeppelin Login Banner

Apache Zeppelin does not yet have an easy way to include a logon banner, login banner, warning banner, 'notice and consent banner', 'approved system use notification', security message, or whatever-you-want-to-call-it.

A client needed one so I put together these instructions for a workaround. I plan on submitting a proper patch to Zeppelin if I can find some spare time. Hopefully someone else will do it first :)

These instructions use Hortonworks HDP 2.6.4.0-91 for specific locations. If you are using something else, you just need to change the path of the war and html file.

Zeppelin serves the login popup from /usr/hdp/2.6.4.0-91/zeppelin/webapps/webapp/components/login/login.html
It is possible to just change that file, but every time the Zeppelin service is restarted, the Zeppelin war is unpacked and login.html is overwritten.

We will alter the login.html copy inside the war file. This will have to be redone every time Zeppelin  is upgraded (such as a new HDP release), but not during normal operations.

The result is a message that appears on the login popup.

Saturday, February 24, 2018

Very quick Domain Controller Cert Auth for testing

I needed to test certain scenarios for a client against a Microsoft Active Directory Domain Controller and Intermediate Certificate Authority. The easiest way was to use Vagrant with the mwrock/Windows2012R2 box.

I wasn't able to automate the complete install, but did get it to a set of cut-and-paste lines.

Code is at [ https://github.com/awcoleman/vagrant_win_ad_dc_ca_test ]

Copy Vagrantfile into new directory
Follow directions in README.txt

The next iteration will probably use Ansible support for Windows (unfortunately there is no CA module)

Monday, July 3, 2017

Giraph Error: Could not find or load main class org.apache.giraph.yarn.GiraphApplicationMaster

Very old post from 2014 that got lost in my drafts. Posting so hopefully this helps out someone.

Often Google acts like magic for me: type in my error, and out pops the solution. Not so for a Giraph error I recently hit. Hopefully this post lets Google work like magic for someone else :)

After installing Giraph on a BigTop 0.7 VM, I was able to run the benchmark that takes no input or output but nothing more complicated.

This works:
hadoop jar /usr/share/doc/giraph-1.0.0.5/giraph-examples-1.0.0-for-hadoop-2.0.6-alpha-jar-with-dependencies.jar org.apache.giraph.benchmark.PageRankBenchmark -Dgiraph.zkList=127.0.0.1:2181 -libjars /usr/lib/giraph/giraph-1.0.0-for-hadoop-2.0.6-alpha-jar-with-dependencies.jar -e 1 -s 3 -v -V 50 -w 1

But this:
hadoop jar /usr/share/doc/giraph-1.0.0.5/giraph-examples-1.0.0-for-hadoop-2.0.6-alpha-jar-with-dependencies.jar org.apache.giraph.GiraphRunner -Dgiraph.zkList=127.0.0.1:2181 -libjars /usr/lib/giraph/giraph.jar org.apache.giraph.examples.SimpleShortestPathsVertex -vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip /user/acoleman/giraphtest/tiny_graph.txt -of org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /user/acoleman/giraphtest/shortestpathsC2 -ca SimpleShortestPathsVertex.source=2 -w 1

does not.

Looking at the latest container logs with:
cat $(ls -1rtd $(ls -1rtd /var/log/hadoop-yarn/containers/application_* | tail -1)/container_* | tail -1)/*

I find:
Error: Could not find or load main class org.apache.giraph.yarn.GiraphApplicationMaster

I beat my head against the wall trying to add to libjars, to -yj, copying jars into every directory i could find.

I stumbled across
http://mail-archives.apache.org/mod_mbox/giraph-user/201312.mbox/%3C198091226.KO6f1kuK42@chronos7%3E

which gives the answer. If https://issues.apache.org/jira/browse/GIRAPH-814 hasn't been applied, then mapreduce.application.classpath has to be hard set or Giraph simply won't work.

vi /etc/hadoop/conf.pseudo/mapred-site.xml
  <property>
    <name>mapreduce.application.classpath</name>
    <value>/usr/lib/hadoop-mapreduce/*,/usr/lib/hadoop-mapreduce/lib/*,/usr/lib/giraph/giraph-1.0.0-for-hadoop-2.0.6-alpha-jar-with-dependencies.jar
    </value>
  </property>

I did not need to restart yarn-resourcemanager or yarn-nodemanager for this to get picked up.

DropWizard and Hive (and/or Impala)

I have a small DropWizard/D3.js/jqGrid application to visualize the results of some analysis. I had been taking the results of the analysis from hdfs and shoveling it into mySQL (with sqoop) to examine samples. This is working well enough that I wanted to go straight to the source. With DropWizard this should be easy enough to wrap my data in a Hive external table and use the Hive JDBC driver instead of mySQL.

If you are already familiar with DropWizard and just need an example, examine the pom.xml and config-hive.yaml files in my example application on GitHub.

To pull in Hive JDBC and its dependencies, add to pom.xml:

       
  <dependency>
   <groupid>org.apache.hive</groupid>
   <artifactid>hive-jdbc</artifactid>
   <version>1.1.0</version>
   <exclusions>
    <exclusion>
     <groupid>org.slf4j</groupid>
     <artifactid>slf4j-log4j12</artifactid>
    </exclusion>
    <exclusion>
     <groupid>com.sun.jersey</groupid>
     <artifactid>*</artifactid>
    </exclusion>
   </exclusions>
  </dependency>
  <dependency>
   <groupid>org.apache.hadoop</groupid>
   <artifactid>hadoop-common</artifactid>
   <version>2.6.0</version>
   <exclusions>
    <exclusion>
     <groupid>org.slf4j</groupid>
     <artifactid>slf4j-log4j12</artifactid>
    </exclusion>
    <exclusion>
     <groupid>com.sun.jersey</groupid>
     <artifactid>*</artifactid>
    </exclusion>
   </exclusions>
  </dependency>