15

Does S3 bucket has information in regard to when is the last time it has been updated? How can I find the last time any of the objects in the bucket were updated?

Mark Amery
  • 143,130
  • 81
  • 406
  • 459
Cory
  • 14,865
  • 24
  • 57
  • 72

7 Answers7

11

There is no native support for bucket last modified time. The way I do it is to use aws cli , sort the output, take the bottom line and print the first 2 fields.

$ aws s3 ls mybucket --recursive | sort | tail -n 1 | cut -d ' ' -f1,2
2016-03-18 22:46:48
helloV
  • 50,176
  • 7
  • 137
  • 145
8

Recommendation, tl;dr

The best compromise for a simple command that is performant, at the time of this writing based on the simplistic performance test, would be aws s3 ls --recursive (Option #2)


3 ways to get the last modified object

1. Using s3cmd

(See s3cmd Usage, or explore the man page after installing it using sudo pip install s3cmd )

s3cmd ls s3://the-bucket | sort| tail -n 1

2. Using AWS CLI's s3

aws s3 ls the-bucket --recursive --output text | sort | tail -n 1 | awk '{print $1"T"$2","$3","$4}'

(Note that awk in the above refers to GNU awk. See this if you need to install this, as well as for any other GNU utilities on macOS)


3. Using AWS CLI's s3api

(with either list-objects or list-objects-v2)

aws s3api list-objects-v2 --bucket the-bucket | jq  -r '.[] | max_by(.LastModified) | [.Key, .LastModified, .Size]|@csv'

Note that both of the s3api commands are paginated and handling the pagination is a fundamental improvement in v2 of the list-objects.

If the bucket has more than a 1000 objects (use s3cmd du "s3://ons-dap-s-logs" | awk '{print $2}' to get the number of objects), then you'll need to handle pagination of the API and make multiple calls to get back all the results since the sort order of the returned results is UTF-8 binary order and not 'Last Modified'.


Performance comparison

Here is a simple performance comparison of the above three methods executed for the same bucket. For simplicity, the bucket had less than a 1000 objects. Here is the one-liner to see the execution times:

export bucket_name="the-bucket" && \
( \
time ( s3cmd     ls --recursive           "s3://${bucket_name}"             | awk '{print $1"T"$2","$3","$4}' | sort | tail -n 1                       ) & ; \
time ( aws s3    ls --recursive           "${bucket_name}"    --output text | awk '{print $1"T"$2","$3","$4}' | sort | tail -n 1                       ) & ; \
time ( aws s3api list-objects-v2 --bucket "${bucket_name}"                  | jq  -r '.[] | max_by(.LastModified) | [.LastModified, .Size, .Key]|@csv' ) & ; \
time ( aws s3api list-objects    --bucket "${bucket_name}"                  | jq  -r '.[] | max_by(.LastModified) | [.LastModified, .Size, .Key]|@csv' ) &
) >! output.log

(output.log will store the last modified objects listed by each command)

The output of the above is as follows:

( s3cmd ls --recursive ...)      1.10s user 0.10s system 79% cpu 1.512 total
( aws s3 ls --recursive ...)     0.72s user 0.12s system 74% cpu 1.128 total
( aws s3api list-objects-v2 ...) 0.54s user 0.11s system 74% cpu 0.867 total
( aws s3api list-objects ...)    0.57s user 0.11s system 75% cpu 0.900 total

For the same number of objects being returned, aws s3api calls are appreciably more performant; however, there is the additional (scripting) complexity for dealing with the pagination of the API.

Useful link(s): See Leveraging s3 and s3api to understand the difference between aws s3 and aws s3api

Ashutosh Jindal
  • 18,501
  • 4
  • 62
  • 91
2

As others have commented, there's no magic bit of metadata that stores this information. You just have to loop over the objects.

Code to do that with boto3:

import boto3
from datetime import datetime

def bucket_last_modified(bucket_name: str) -> datetime:
    """
    Given an S3 bucket, returns the last time that any of its objects was
    modified, as a timezone-aware datetime.
    """
    s3 = boto3.resource('s3')
    bucket = s3.Bucket(bucket_name)
    objects = list(bucket.objects.all())
    return max(obj.last_modified for obj in objects)
Mark Amery
  • 143,130
  • 81
  • 406
  • 459
1

My workaround is to write a bucket_metadata.json file to the bucket with a "last_updated" key and a unix timestamp:

{ "last_updated": 1634243586 } 

Then whenever you update the bucket, you generate another timestamp and re-write the file.

seattledev
  • 11
  • 2
0

Leveraging aggregation feature of the aws s3api commmand, you can easily get some key metrics via:

aws s3api list-objects --bucket "bucket_name" --output json --query "[sum(Contents[].Size), length(Contents[]), max(Contents[].LastModified)]"

If the bucket is empty, the aggregations fail due to null values, and you will receive an error message: In function sum(), invalid type for value: None, expected one of: ['array-number'], received: "null"

If the bucket is too large, your command might get killed by the OS.

Vajk Hermecz
  • 5,413
  • 2
  • 34
  • 25
0

I have a bash script and python script can do the job, but I find that it quiet slow when the S3 bucket have millions objects, So if anyone can improve the script would be great. Bash:

#!/bin/bash

bucket_name_list=$(aws s3api list-buckets --query "Buckets[].Name" --output text)
for bucket_name in $bucket_name_list
do
    # echo $bucket_name
    last_access_time=$(aws s3 ls $bucket_name --recursive --output text | sort | tail -n 1 | awk '{print $1"T"$2","$3","$4}')
    echo "${bucket_name}: ------->  ${last_access_time}"
done

Python:

import boto3
from datetime import datetime


aws_session = boto3.session.Session(profile_name='default')
s3_resource = aws_session.resource('s3')

def bucket_last_modified() -> datetime:
    
    for s3_bucket in s3_resource.buckets.all():
        s3_bucket_name = s3_bucket.name
        bucket = s3_resource.Bucket(s3_bucket_name)
        objects = list(bucket.objects.all())
        if len(objects) != 0:
            last_access_time = max(obj.last_modified for obj in objects)
            print('Bucket Name: ' + s3_bucket_name + ' last access time: ' + str(last_access_time))
        else:
            print('Bucket Name: ' + s3_bucket_name + ' is empty')
        
if __name__ == '__main__':
  bucket_last_modified()
Yvette Lau
  • 191
  • 1
  • 7
-1

Amazon S3 API spec on GET BUCKET Object Versions (available at: http://docs.aws.amazon.com/AmazonS3/latest/API/RESTBucketGETVersion.html) says that there's LastModified property returned - but I'm not sure if it get's updated on change for each object ...

  • This isn't the answer to the question that was asked. `LastModified`: "Date and time the *object* was last modified." This property is returned for *each individual object version*. It is not a single value for the bucket itself. – Michael - sqlbot Mar 19 '16 at 00:23
  • Yes, you are right - so I guess the only way is to recursively scan the whole subtree ... might be expensive – Krzysztof Kielak Mar 19 '16 at 19:33