We have had this same problem for one of our customers who had millions of small files within AWS S3. The crawler practically would stall and not proceed and continue to run infinitely. We came up with the following alternative approach :
- A Custom Glue Python Shell job was written which leveraged AWS Wrangler to fire queries towards AWS Athena.
- The Python Shell job would List the contents of folder s3:///event_date=<Put the Date Here from #2.1>
- The queries fired :
alter table add partition (event_date='<event_date from above>',eventname=’List derived from above S3 List output’)
4. This was triggered to run post the main Ingestion Job via Glue Workflows.