4

We are using Java 8 and using AWS SDK to programmatically upload files to AWS S3. For uploading large file (>100MB), we read that the preferred method to use is Multipart Upload. We tried that but it seems it does not speed it up, upload time remains almost the same as not using multipart upload. Worse is, we even encountered out of memory errors saying heap space is not sufficient.

Questions:

  1. Is using multipart upload really supposed to speed up the upload? if not, then why use it?
  2. How come using multi part upload eats up memory faster than not using? does it concurrently upload all the parts?

See below for the code we used:

private static void uploadFileToS3UsingBase64(String bucketName, String region, String accessKey, String secretKey,
        String fileBase64String, String s3ObjectKeyName) {
    
    byte[] bI = org.apache.commons.codec.binary.Base64.decodeBase64((fileBase64String.substring(fileBase64String.indexOf(",")+1)).getBytes());
    InputStream fis = new ByteArrayInputStream(bI);
    
    long start = System.currentTimeMillis();
    AmazonS3 s3Client = null;
    TransferManager tm = null;

    try {

        s3Client = AmazonS3ClientBuilder.standard().withRegion(region)
                .withCredentials(new AWSStaticCredentialsProvider(new BasicAWSCredentials(accessKey, secretKey)))
                .build();
        
        tm = TransferManagerBuilder.standard()
                  .withS3Client(s3Client)
                  .withMultipartUploadThreshold((long) (50* 1024 * 1025))
                  .build();

        ObjectMetadata metadata = new ObjectMetadata();
        metadata.setHeader(Headers.STORAGE_CLASS, StorageClass.Standard);
        PutObjectRequest putObjectRequest = new PutObjectRequest(bucketName, s3ObjectKeyName,
                fis, metadata).withSSEAwsKeyManagementParams(new SSEAwsKeyManagementParams());
        
        Upload upload = tm.upload(putObjectRequest);

        // Optionally, wait for the upload to finish before continuing.
        upload.waitForCompletion();

        long end = System.currentTimeMillis();
        long duration = (end - start)/1000;
        
        // Log status
        System.out.println("Successul upload in S3 multipart. Duration = " + duration);
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        if (s3Client != null)
            s3Client.shutdown();
        if (tm != null)
            tm.shutdownNow();
    }

}
  • 1
    btw, why you use fileBase64String if you have files, why not a `File` object or `FileInputStream`? if it's in the database, why is it not a blob object and read as a stream? that should help with memory issues at least, and maybe help with performance too. – eis Jul 06 '21 at 10:13

2 Answers2

2

Using multipart will only speed up the upload if you upload multiple parts at the same time.

In your code you're setting withMultipartUploadThreshold. If your upload size is larger than that threshold, then you should observe concurrent upload of separate parts. If it is not, then only one upload connection should be used. You're saying that you have >100 MB file, and in your code you have 50 * 1024 * 1025 = 52 480 000 bytes as the multipart upload threshold, so concurrent upload of parts of that file should have been happening.

However, if your upload throughput is anyway capped by your network speed, there would not be any increase in throughput. This might be the reason you're not observing any speed increase.

There are other reasons to use multipart too, as it is recommended for fault tolerance reasons as well. Also, it has a larger maximum size than single upload.

For more details see documentation:

Multipart upload allows you to upload a single object as a set of parts. Each part is a contiguous portion of the object's data. You can upload these object parts independently and in any order. If transmission of any part fails, you can retransmit that part without affecting other parts. After all parts of your object are uploaded, Amazon S3 assembles these parts and creates the object. In general, when your object size reaches 100 MB, you should consider using multipart uploads instead of uploading the object in a single operation.

Using multipart upload provides the following advantages:

  • Improved throughput - You can upload parts in parallel to improve throughput.

  • Quick recovery from any network issues - Smaller part size minimizes the impact of restarting a failed upload due to a network error.

  • Pause and resume object uploads - You can upload object parts over time. After you initiate a multipart upload, there is no expiry; you must explicitly complete or stop the multipart upload.

  • Begin an upload before you know the final object size - You can upload an object as you are creating it.

We recommend that you use multipart upload in the following ways:

  • If you're uploading large objects over a stable high-bandwidth network, use multipart upload to maximize the use of your available bandwidth by uploading object parts in parallel for multi-threaded performance.

  • If you're uploading over a spotty network, use multipart upload to increase resiliency to network errors by avoiding upload restarts. When using multipart upload, you need to retry uploading only parts that are interrupted during the upload. You don't need to restart uploading your object from the beginning.

eis
  • 51,991
  • 13
  • 150
  • 199
  • 1
    On using withMultipartUploadThreshold, how do I know the size of each part? lets say I uploaded a 1GB file, how many parts will be there and will all parts be uploaded concurrently? – Alain Del Rosario Jul 06 '21 at 00:29
  • 1
    @AlainDelRosario "withMultipartUploadThreshold(Long multipartUploadThreshold). Sets the size threshold, in bytes". 1GB is 1 073 741 824 bytes, so with 50*1024*1025 = 52 480 000 you should get 1 073 741 824 / 52 480 000 = 20,46 -> 21 parts with the size of 52 480 000 bytes (except the last part), which should have been sent concurrently. This assumes you're sending something like a file that can be split into parts automatically. – eis Jul 06 '21 at 08:22
2

The answer from eis is very fine. Though you still should take some action:

  • String.getBytes(StandardCharsets.US_ASCII) or ISO_8859_1 prevents using a more costly encoding, like UTF-8. If the platform encoding would be UTF-16LE the data would even be corrupt (0x00 bytes).
  • The standard java Base64 has some de-/encoders that might work. It can work on a String. However check the correct handling (line endings).
  • try-with-resources closes also in case of exceptions/internal returns.
  • The ByteArrayInputStream was not closed, which would have been better style (easier garbage collection?).
  • You could set the ExecutorFactory to a thread pool factory limiting the number of threads globally.

So

byte[] bI = Base64.getDecoder().decode(
        fileBase64String.substring(fileBase64String.indexOf(',') + 1));
try (InputStream fis = new ByteArrayInputStream(bI)) {
    ...
}
Joop Eggen
  • 107,315
  • 7
  • 83
  • 138