1

I have a lot of small parquet files that are read via AWS Glue into Athena. I know that small parquet files (35k or so each due to the way the log outputs them) are not ideal but once they are read into the data catalog, does it matter anymore?

In other words, should I go through the exercise of merging all the small parquet files into more ideally sized files prior to loading into Athena?

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
Rob M
  • 55
  • 1
  • 5

1 Answers1

2

You continue to pay a price for small files even after they've been registered with the data catalog. When you query a table based on many small files, Athena has to work harder to gather and stream all of the necessary data it needs to scan in order to answer your query. Although the amount of data you ultimately scan may be comparable, doing it on chunkier files results in less overhead for the query engine (presto).

Reference: https://docs.aws.amazon.com/athena/latest/ug/performance-tuning.html - note how it mentions S3 throttling might also bite you when you have lots of small files

Also, in the case of parquet files, the files may have an index that the query engine can use to skip scanning files, or jump to the right spots in a particular file. I believe the effectiveness of such indexes would be reduced on many small files.

It's easy enough to convert the small files into chunkier ones via a CTAS statement that I'd recommend doing it. In my experience, I can anecdotally see queries execute faster against my batched files.

kylejmcintyre
  • 1,898
  • 2
  • 17
  • 19
  • Thanks for the answer. From what I've seen, probably the best way to merge all the small parquet files will be to use an EMR cluster with Spark. Anything better? – Rob M Mar 28 '21 at 15:13
  • No need for that! Just do a CREATE TABLE AS SELECT with the same schema definition but a different external file location and format. https://docs.aws.amazon.com/athena/latest/ug/ctas.html – kylejmcintyre Mar 29 '21 at 03:11
  • I should also mention, you can use the bucketed_by and bucket_count properties to control exactly how many resulting files there are, or you can use a partition scheme as long as the resulting number of partitions <=100. Your files will be aggregated, but you have less control as compared to specifying the # of buckets exactly – kylejmcintyre Mar 29 '21 at 03:19
  • Just wanted to add - the CTAS queries worked great. Super solution which allows the raw data files to remain unmanipulated and archived after they've been combined via the query. – Rob M Mar 30 '21 at 23:30
  • if we add more data to s3, which we want to include as part of the CTAS table, do we need to run CTAS all over again ? or is there a way to append to the CTAS while preserving partitioning if any ? – Abhishek Malik Sep 09 '21 at 16:41
  • It's possible to add to it. You can query out the external location of the CTAS table and manually put more files there. However, just creating a new table might be the more common approach - depending on data sizes obviously – kylejmcintyre Sep 09 '21 at 20:05
  • @AbhishekMalik , you can use INSERT INTO statement to add more records to the CTAS table. However, Athena doesn't support INSERT statements on bucketed tables. – Anush Surendran Oct 19 '21 at 12:30