Always read latest folder from s3 bucket in spark

Question

Below is how my s3 bucket folder structure looks like,

s3://s3bucket/folder1/morefolders/$folder_which_I_want_to_pick_latest/

$folder_which_I_want_to_pick_latest - This folder can always have an incrementing number for every new folder that comes in, like randomnumber_timestamp

Is there a way I can automate this process by always reading the most recent folder in s3 from spark in Scala

Not voting to close but this could be an answer: https://stackoverflow.com/questions/50526076/find-latest-file-pyspark — blackbishop, Jan 28 '20 at 12:34
Might be a good example for choosing latest file, But I am looking among latest folders which have files in them as parquet. Also I would require scala solution — jay, Jan 28 '20 at 21:47

score 2 · Answer 1 · answered Jan 27 '20 at 23:16

The best way to work with that kind of "behavior" is structure your data as a partitioned approach, like year=2020/month=02/day=12, where, every partition is a folder (in aws-console). In this way you can use a simple filter on spark to determine the latest one. (more info: https://www.datio.com/iaas/understanding-the-data-partitioning-technique/)

However, if you are not allowed to re-structure your bucket, the solution could be costly if you don't have a specific identifier and/or reference that you can use to calculate your newest folder. Remember, that in s3 you don't have a concept of folder, you have only an object key (here is where you see the / and in aws console can be visualized as folders), so, to calculate the highest incremental id in $folder_which_I_want_to_pick_latest will eventually check in all the objects stored in the bucket and every object-request in s3 costs. More info: https://docs.aws.amazon.com/AmazonS3/latest/user-guide/using-folders.html.

score 1 · Answer 2 · answered Jan 27 '20 at 22:37

Here's one option. Consider writing a Lambda function that either runs on a schedule (say if you knew that your uploads always happen between 1pm and 4pm) or is triggered by an S3 object upload (so it happens for every object uploaded to folder1/morefolders/).

The Lambda would write the relevant part(s) of the S3 object prefix into a simple DynamoDB table. The client that needs to know the latest prefix would read it from DynamoDB.

Always read latest folder from s3 bucket in spark

2 Answers2