0

I tried using this post to look for the last modified file then awk for the folder it's contained in: Get last modified object from S3 using AWS CLI

But this isn't ideal for over 1000 folders and by documentation, should be failing. I have 2000+ folder objects I need to search through. My desired folder will always begin with an D and be followed by a set of incrementing numbers. Ex: D1200

The results from the answer led me to creating this call which works:

aws s3 ls main.test.staging/General_Testing/Results/ --recursive | sort | tail -n 1 | awk '{print $4}'

but it takes over 40 secs to search through thousands of files and I then need to regex parse the output to find the folder object and not the last file modified within it. Also, if I try to do this to find my desired folder (which is the object right after the Results object):

aws ls s3 main.test.staging/General_Testing/Results/ | sort | tail -1

Then my output will be D998 because the sort function will order folder names like this:

D119
D12
D13

Because technically D12 is greater than D119 because it has a 2 in the 2nd position. Following this strange logic, there's no way I can use that call to reliable retrieve the highest numbered folder and therefore the last one created. Something to note is that folder objects that contain files do not have a Last Modified tag that one can use to query.

To be clear of my question: What call can I use to look through a large amount of S3 objects to find the largest numbered folder object? Preferably the answer is fast, can work with 1000+ objects, and won't require a regex breakdown.

croakPedlar
  • 202
  • 3
  • 14
  • Sorry, so what is your question or issue? Its not very clear what you are strangling with? Large number of files? Lack of `Last Modified`? Wrong sorting order? – Marcin Feb 03 '22 at 02:33
  • @Marcin please tell me how I can be more clear, but I'm looking for a call to find my desired folder name that doesn't take too long. The note is just so ppl dont give me "query for last modified" as an answer. Even though, if they know AWS, maybe I should assume they already know that. – croakPedlar Feb 03 '22 at 07:53

1 Answers1

2

I wonder whether you can use a list of CommonPrefixes to overcome your program of having many folders?

Try this command:

aws s3api list-objects-v2 --bucket main.test.staging --delimiter '/' --prefix 'General_Testing/Results/' --query CommonPrefixes --output text

(Note that is uses s3api rather than s3.)

It should provide a list of 'folders'. I don't know whether it has a limit on the number of 'folders' returned.

As for sorting D119 before D2, this is because it is sorting strings. The output is perfectly correct when sorting strings.

To sort by the number portion, you can likely use "version sorting". See: How to sort strings that contain a common prefix and suffix numerically from Bash?

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
  • I'm going to upvote bc it's a good answer, but this is an equivalent to the "working" call that I have in the question bc it returns `General_Testing/Results/D###`. Therefore, it's an answer I still have to regex parse to grab the last `D` folder object. – croakPedlar Feb 03 '22 at 21:36
  • You could use `--prefix 'General_Testing/Results/D'` and it will only return objects with a `D`. You would then need to extract the number portion of the CommonPrefix and sort as numbers. It might be easier to do in a language like Python rather than shell. – John Rotenstein Feb 03 '22 at 21:47
  • Going to mark as answer bc what worked for me was leading off your "version sorting" answer and doing `aws s3 ls main.test.staging/General_Testing/Results/ | sort -V | tail -1` which worked remarkably well for over 2000 folders even though by documentation I don't think it's supposed to. Either way, thank you for your help. – croakPedlar Feb 04 '22 at 21:14