Thursday, June 10, 2021

Trino and Zero-Length Parquet Files in HDFS Part 2

Continuing from Part 1.

The test application to write parquet files over the course of an hour as on github at https://github.com/awcoleman/example_trino_hdfs_zero_length

We can run the app in quick mode to populate some hive-style directories with data:



Create a table from those hive-style directories:



And tell the Hive Metastore about these partitions:



(Here we just tell hivemetastore to refresh all partitions, we could have just added single partitions)


And query data with Trino:



Now imagine if a legacy application opened a parquet file in an older partition directory and held that file open while waiting to see if there was any more incoming old data. We can use our test application to simulate that:



We can see the hdfs directory now has another file:



If we run the same query in Trino again, we get an error:


And the Trino server.log shows us the issue is in the footer:


Queries in other directories without open parquet files work fine:



No comments:

Post a Comment