0

I have AWS Glue job, that I created using the glue job visualizer. The job reads data from S3 using glue catalog and spark, aggregate the data and store it in new S3 objects partitioned by day. The output data will be queried later.

I see that the job output is a lot of small objects(Can be around 50 objects 2kb each) in each partition and each run (There are 4 runs a day. so 200 objects per day).

I understand that it is not recommended to have a lot of small objects, so is there a way to prevent the creation of many objects? or should I just leave it and not worry about it?

I read about coalesce/repartition but I don't want to give hardcoded number of partitions to create since the size of the input data can change.

Example of output files (There are 55 small files)

Thanks.

guylot
  • 201
  • 2
  • 13

0 Answers0