2

I'm attempting to download a significant number of small files from AWS S3 (50,000+) and I'm consistently noting that the AWS CLI sync command is dominating my solution written in boto3. I've started looking through the AWS CLI source for ways my boto solution could improve, but beyond replicating the TransferManager and TransferConfig I don't see precisely what causes such a performance increase

For example, the sync command:

aws s3 sync s3://my-bucket/folder/subfolder folder/subfolder

For 50,000 small files takes around 111 seconds. In fact, that's even after I've wrapped the sync in a subprocess run. And my similar code in python takes nearly double that time (210 seconds):

import boto3
import botocore
from s3transfer.manager import TransferManager
from s3transfer.manager import TransferConfig

# can get list of files anyhow, list_objects_v2, etc
files = []

botocore_config = botocore.config.Config(max_pool_connections=10)
s3_client = boto3.client('s3', config=botocore_config)

transfer_config = TransferConfig(
   max_request_concurrency=10,
   max_request_queue_size = 1000,
   multipart_threshold = 8 * (1024 ** 2),
   multipart_chunksize = 8 * (1024 ** 2),
   max_bandwidth = None
)

transfer_config.max_in_memory_upload_chunks = 6
transfer_config.max_in_memory_download_chunks = 6

s3t = TransferManager(s3client, transfer_config)

start = time.time()

for file in files:
    s3t.download(bucket=bucket, key=file, fileobj=file, subscribers=None)

s3t.shutdown()

end = time.time()
benchmark = end - start
print(f's3transfer {benchmark}')

My question is, what am I missing to get this download more performant? I see that the AWS cli is able to download files much faster despite relying on the same botocore low level module to make its requests.

  • Is the destination folder in sync always empty? otherwise, you aren't downloading all 50k files, just checking for deltas. Also, what happens if you just use s3 = boto3.client('s3') s3.download_file('BUCKET_NAME', 'OBJECT_NAME', 'FILE_NAME') and skip s3transfer altogether. I've not used s3transfer, but have used multiprocess to speed up downloads of many small files. Maybe there's just more overhead using s3transfer??? – Jonathan Leon Oct 19 '21 at 00:21
  • what is `TransferManager`? – Marcin Oct 19 '21 at 02:12
  • I make sure to clean the destination folder prior to runs. The AWS CLI uses s3transfer to actually run `sync` commands, and other boto3 download examples I've seen indicate that boto3 sessions aren't thread safe, but the client's are. Additionally, `TransferManager` is a class declared by boto3 (and a version is also declared as part of s3transfer) to supposedly help increase throughput – FlamboyantBacon Oct 19 '21 at 13:37

0 Answers0