I'm using Spark (2.4) to process I data being stored on S3.
I'm trying to understand if there's a way to spare the listing of the objects that I'm reading as my batch job inputs (I'm talking about ~1M )
I know about S3Guard that stores the objects metadata, and thought that I can use it for skipping the S3 listing.
I've read this Cloudera's blog
Note that it is possible to skip querying S3 in some cases, just serving results from the Metadata Store. S3Guard has mechanisms for this but it is not yet supported in production.
I know it's quite old , is it already available in production?