1

Stack:

Trino version: 395

Storage: Alluxio with AWS S3

Metadata store: AWS glue

I have a daily spark job to save parquet file with 3 partition key(year, month, day) in S3, then all the data will be synchronized to Alluxio. However, although I check that all data exist in both S3 and Alluxio, I can't query the latest data until I manually call the system.sync_partition_metadata() every time. This is how I create the table:

create table glue.table_tc.table_name (
    col1 varchar, 
    col2 varchar, 
    col3 varchar, 
    col4 varchar, 
    col5 bigint, 
    year int, 
    month int, 
    day int
) with (
    format='parquet', 
    partitioned_by=array['year', 'month', 'day'], 
    external_location='alluxio://path/to/table');

I initially think that this is caused by the cache of metadata and out of synchronization. Therefore, I tried to turn hive.metastore-cache.cache-partitions to false to avoid caching. Also, I tried to shorten the hive.metastore-refresh-interval to 5s, but both are not worked.

May I know how to synchronize the metadata / partition value automatically? Did I miss something? Thank you so much for your help!

Jonathan Lam
  • 1,761
  • 2
  • 8
  • 17
  • you can programatically schedule to run system.sync_partition_metadata() or run Glue crawler over s3 so that partitions can be populated automatically. – Prabhakar Reddy Nov 08 '22 at 12:22
  • @PrabhakarReddy Thanks for your reply. On the other hand, to minimize the workload that we have to maintain, is there anything we can do in configuration / infrastructure level? – Jonathan Lam Nov 14 '22 at 01:10
  • sync_partition_metadata will only add partitions that are not present in table. Hive msck repair or alter table add partition can reduce load on Trino as this will be done via Hive. – Prabhakar Reddy Nov 14 '22 at 02:27
  • @PrabhakarReddy Thanks for your reply. We use S3 as our data storage and we didn't build Hive on top of it. Basically we use Trino to do querying directly to the storage layer. On the other hand, based on my understanding, Hive connector in Trino only use the data and its metadata, and it does not use any HiveQL or any part of Hive’s execution environment. I'm not sure if your suggestion works in our case, since I have to build a Hive and write the data into Hive, instead of querying directly from S3 / Alluxio? – Jonathan Lam Nov 14 '22 at 03:01
  • Can you add to the question trino hive configuration, please? – JIST Jan 28 '23 at 16:28
  • @JIST thanks for your advice, but I don't have enough reps to create new tag. – Jonathan Lam Jan 30 '23 at 08:57

0 Answers0