I'm attempting to download a significant number of small files from AWS S3 (50,000+) and I'm consistently noting that the AWS CLI sync
command is dominating my solution written in boto3. I've started looking through the AWS CLI source for ways my boto solution could improve, but beyond replicating the TransferManager and TransferConfig I don't see precisely what causes such a performance increase
For example, the sync command:
aws s3 sync s3://my-bucket/folder/subfolder folder/subfolder
For 50,000 small files takes around 111 seconds. In fact, that's even after I've wrapped the sync
in a subprocess
run. And my similar code in python takes nearly double that time (210 seconds):
import boto3
import botocore
from s3transfer.manager import TransferManager
from s3transfer.manager import TransferConfig
# can get list of files anyhow, list_objects_v2, etc
files = []
botocore_config = botocore.config.Config(max_pool_connections=10)
s3_client = boto3.client('s3', config=botocore_config)
transfer_config = TransferConfig(
max_request_concurrency=10,
max_request_queue_size = 1000,
multipart_threshold = 8 * (1024 ** 2),
multipart_chunksize = 8 * (1024 ** 2),
max_bandwidth = None
)
transfer_config.max_in_memory_upload_chunks = 6
transfer_config.max_in_memory_download_chunks = 6
s3t = TransferManager(s3client, transfer_config)
start = time.time()
for file in files:
s3t.download(bucket=bucket, key=file, fileobj=file, subscribers=None)
s3t.shutdown()
end = time.time()
benchmark = end - start
print(f's3transfer {benchmark}')
My question is, what am I missing to get this download more performant? I see that the AWS cli is able to download files much faster despite relying on the same botocore low level module to make its requests.