Andrew Coleman: Trino and Zero-Length Parquet Files in HDFS Part 1

Trino (formerly Presto) is a great distributed query engine. It allows one to use SQL to query data in parquet files.

Parquet files have file metadata in a footer at the end of the file. The footer is written when the parquet file is closed.

I have a client with a legacy application that writes parquet files to HDFS or S3 in a hive partition structure. Parquet files written to S3 do not exist until they are closed, however those written to HDFS show as zero-length files until closed. This can be a problem for Trino since the parquet footer has not been written yet.

Trino can use Hive Metastore as the metastore to hold information linking tables to data files. Hive Metastore partitions operate at the directory level, not file level. This means that all the files listed in a directory are shown as part of the partition.

In later posts, we'll look at Apache Iceberg as an alternative to Hive Metastore to avoid this issue.

In the next posts, we set up a test environment to show this happening.

Andrew Coleman

Thursday, June 10, 2021

Trino and Zero-Length Parquet Files in HDFS Part 1

No comments:

Post a Comment