2

Since there is ability to set argument --expected-size in aws s3 cp in order to ensure files/data larger than 5GB is uploaded successfully, how can it be set in python version of it: boto3 upload_fileobj?

I'm trying to upload database backup as data stream to S3 without saving it to disk, but it fails in the middle of process due to InvalidArgument: Part number must be an integer between 1 and 10000, inclusive.

I assume it's because data stream is non-seekable so you have to set expected data size explicitly.

AWS CLI example:

innobackupex --stream=xbstream --compress /backup \
    | aws s3 cp - s3://backups/backup2018112 --expected-size=1099511627776

Boto3 example:

import subprocess
import boto3

innobackupexProc = subprocess.Popen([
    'innobackupex',
    '--stream=xbstream',
    '--compress',
    '/backup'
], stdout=subprocess.PIPE)

s3 = boto3.client('s3')
with innobackupexProc.stdout as dataStream:
    s3.upload_fileobj(dataStream, 'backups', 'backup2018112')
ritmas
  • 345
  • 1
  • 4
  • 9
  • Did you ever solve this? I'm streaming files anywhere up to 100GB in size to S3, and don't know the total length in advance. Some way to request a larger part-size would be great. – Mark K Cowan May 17 '22 at 14:55

1 Answers1

0

The error is due to upload_fileobj using a default part size of 8 MiB. The file in your example CLI code is 1,099,511,627,776 bytes, which, with the default part size (8,388,608 bytes), results in 131,072 parts, way beyond the maximum of 10,000 parts for Amazon S3 multipart upload.

The maximum part size is 5 GiB, so, as long as your file is less than S3's maximum object size of 5 TiB, you can just divide your total file size by 10,000 (rounding up) to get a part size that will work. In your example, this would be 109,951,163 bytes - about 105 MiB.

You can then set the part size for the multipart upload via upload_fileobj's Config parameter:

import subprocess
import boto3
from boto3.s3.transfer import TransferConfig

# Amazon S3's maximum number of parts for multipart upload
MAX_PARTS = 10000

innobackupexProc = subprocess.Popen([
    'innobackupex',
    '--stream=xbstream',
    '--compress',
    '/backup'
], stdout=subprocess.PIPE)

# Assuming expected_size holds the expected number of bytes.
# Just do integer division and add 1, rather than converting
# to floats and using math.ceil - it doesn't matter if the 
# result is one greater than it should be!
# Note - Python 3 changed how / works, so you'll need to use //
part_size = (expected_size / MAX_PARTS) + 1

config = TransferConfig(multipart_chunksize=part_size)

s3 = boto3.client('s3')
with innobackupexProc.stdout as dataStream:
    s3.upload_fileobj(dataStream, 'backups', 'backup2018112', Config=config)
metadaddy
  • 4,234
  • 1
  • 22
  • 46