Aws Glue job output many small files

Asked Aug 21 '22 at 14:27

Active Aug 21 '22 at 14:27

Viewed 439 times

I have AWS Glue job, that I created using the glue job visualizer. The job reads data from S3 using glue catalog and spark, aggregate the data and store it in new S3 objects partitioned by day. The output data will be queried later.

I see that the job output is a lot of small objects(Can be around 50 objects 2kb each) in each partition and each run (There are 4 runs a day. so 200 objects per day).

I understand that it is not recommended to have a lot of small objects, so is there a way to prevent the creation of many objects? or should I just leave it and not worry about it?

I read about coalesce/repartition but I don't want to give hardcoded number of partitions to create since the size of the input data can change.

Thanks.

asked Aug 21 '22 at 14:27

guylot

" 50 objects 2kb each" - this is tiny. Just make it into 1 object in your glue job code. Or better, at the end of day make "200 objects" into one. – Marcin Aug 21 '22 at 21:57
Just have one object per run by coaclese(1) so you only have 4 per day – Prabhakar Reddy Aug 22 '22 at 01:44

Aws Glue job output many small files

0 Answers0