0

I have a camera that adds new files to my AWS S3 bucket every hour, except when it doesn't. For rapid trouble-shooting, I'd like to be able to find (either list or view) the most recent file in the S3 folder. Or list all of the files since a particular date/time. FWIW, the file names are made up of UNIX epoch date stamps, so I could look for file names that contain a number bigger than, say 161315000.

The only solution I have so far is making a listing of all of the files, piped to a text file, which I can then parse. This takes too long...I have tens of thousands of files.

I'd be happy to use AWS CLI, s3cmd, Boto... whatever works.

Chris Sherwood
  • 393
  • 2
  • 4
  • 11

2 Answers2

4

Rather than using the filename ("Key"), you could simply use the LastModified date that S3 automatically attaches when an object is created.

To list the most-recent object based on this date, you could use:

aws s3api list-objects --bucket my-bucket --query 'sort_by(Contents, &LastModified)[-1].Key' --output text

To list objects since a given date (in UTC timezone, I suspect):

aws s3api list-objects --bucket my-bucket --query "Contents[?LastModified>='2021-01-29'].[Key]" --output text

If you wish to do it via Python, you will need to retrieve a list of ALL objects, then you could parse either the object key or LastModified date.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
  • 1
    The commands above will also list all objects and will only filter them client-side afterwards, so they aren't really viable for S3 buckets with large amounts of objects. – Dunedan Feb 13 '21 at 05:15
  • 1
    That is correct. All filtering is done 'locally'. The AWS CLI is actually just a Python program, with access to the same API calls as any program using boto3. – John Rotenstein Feb 13 '21 at 06:14
  • Is there a way to do this locally (on S3) and then move the smaller results? Like: ls -la | tail > lastentries.txt ? My current listing of the S3 folder is 7 MB and >120,000 lines long. Of course, I could reduce that if the AWS ls command accepted --exclude arguments, but it doesn't, right? (wtf?) – Chris Sherwood Feb 14 '21 at 15:54
  • No, this is not possible. The `ListObject()` command returns 1000 objects at a time. With 120,000 objects, this would require 120 API calls. Alternative methods are to use [Amazon S3 Inventory](https://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inventory.html) (as suggested by @Dunedan), or to maintain your own database of objects -- this would involve consulting the database for object lookups rather than directly querying the bucket. – John Rotenstein Feb 14 '21 at 22:43
  • Is there a way to query between two dates, such as LastModified>='2021-01-15' && LastModified<='2021-01-30' ? – anup Feb 15 '23 at 06:02
  • Also, how can I restrict it to a folder in the bucket. I tried --prefix, didn't work. – anup Feb 15 '23 at 06:31
  • @anup JMESPath does support an [AND expression](https://jmespath.org/specification.html#and-expressions) using `&&`. Try: `[?LastModified>='2021-01-15' && LastModified<='2021-01-30']` – John Rotenstein Feb 15 '23 at 08:53
  • @anup The `--prefix` should restrict it to a folder. If it is not working for you, please create a new question with full details of the command you are using and the resulting output. – John Rotenstein Feb 15 '23 at 09:02
3

That's something you can't do with S3 alone, as S3 isn't a file system, but an object store. As such it's optimized for large amounts of objects, not for rapid listing.

If you have control over the format of the object keys, you could prefix them with the current date (like 2021/02/11/161315000). That'd make it easy to find the latest object if you're looking for them only manually for debugging purposes.

If changing the format of the object keys isn't an option, you have to resort to more complex options.

While there exist S3 inventory reports, which do provide a listing of all objects and their last modified time, that's probably something which doesn't work for you either, as those reports get only generated once per day and might not include recently added objects.

An alternative, which might fit better for your use case, would be to utilize S3 event notifications for newly created objects to trigger an AWS Lambda function. This AWS Lambda function could then store the S3 key of the last modified object somewhere (like logging it to Amazon CloudWatch where you could simply check the latest log records for the most recently created S3 object).

Dunedan
  • 7,848
  • 6
  • 42
  • 52
  • 2
    I always hear this statement: "S3 isn't a file system, but an object store." I am sure that is correct, and may be the explanation why it is so difficult to deal with it, but surely any storage system needs useful tools for examining its contents. It completely baffles me that this is not more straightforward. – Chris Sherwood Feb 14 '21 at 15:46
  • Amazon *does* have a file storage system that is similar to the hierarchical file system on most computers. See https://aws.amazon.com/efs/when-to-choose-efs/ – Anton Dec 28 '22 at 19:43