Spark S3Guard - Skip listing S3

Question

I'm using Spark (2.4) to process I data being stored on S3.

I'm trying to understand if there's a way to spare the listing of the objects that I'm reading as my batch job inputs (I'm talking about ~1M )

I know about S3Guard that stores the objects metadata, and thought that I can use it for skipping the S3 listing.

I've read this Cloudera's blog

Note that it is possible to skip querying S3 in some cases, just serving results from the Metadata Store. S3Guard has mechanisms for this but it is not yet supported in production.

I know it's quite old , is it already available in production?

score 2 · Answer 1 · answered Jul 15 '19 at 09:50

As of July 2019 it is still tagged as experimental; HADOOP-14936 lists the tasks there.

The recent work has generally corner cases you aren't going to encounter on a daily basis, but which we know exist and can't ignore.

The specific feature you are talking about, "auth mode", relies on all clients to be using S3Guard and update the tables, and us being happy that we can handle the failure conditions for consistency.

For a managed table, I'm going to say Hadoop 3.3 will be ready to use this. For HADOOP-3.2, it's close. Really, more testing is needed.

In the meantime, if you can't reduce the number of files in S3, can you make sure you don't have a deep directory tree, as its that recursive directory scan which really suffers against it

Spark S3Guard - Skip listing S3

1 Answers1