My S3 bucket has more than 100k parquet files. What is the best way to programmatically merge all the parquet files and make one big parquet

Question

The parquet files are being dumped every minute into the S3 bucket. I have 6 months data which has more than 100k small parquet files. All of them have the same schema. Now I am writing a program to merge all these files. Tried appending one dataframe to another using pandas but obviously that does not seem to be the right way. Just wondering what will be an efficient way

score 3 · Answer 1 · answered Jan 06 '22 at 06:40

I have done this in the past by using Amazon Athena to query all the files and then save the result in a new table, with the data being stored in a new location.

I start by creating a table that points to the existing data. You can either do this manually or use an AWS Glue crawler to define the table based on the existing data files.

Then, I use CREATE TABLE AS - Amazon Athena to define a new table with an appropriate output location, and use SELECT * FROM first_table to extract the data from the existing files.

As per advice on AWS Athena - merge small parquet files or leave them?, use the bucketed_by and bucket_count properties to control exactly how many resulting files are generated.

thankyou, this is a great approach. I will try this – conscious-coder Jan 07 '22 at 12:39 — conscious-coder, Jan 07 '22 at 12:39

My S3 bucket has more than 100k parquet files. What is the best way to programmatically merge all the parquet files and make one big parquet

1 Answers1