4

Need to upload a large file to AWS S3 using multipart-upload using stream instead of using /tmp of lambda.The file is uploaded but not uploading completely.

In my case the size of each file in zip cannot be predicted, may be a file goes up to 1 Gib of size.So I used ZipInputStream to read from S3 and I want to upload it back to S3.Since I am working on lambda, I cannot save the file in /tmp of lambda due to the large file size.So I tried to read and upload directly to S3 without saving in /tmp using S3-multipart upload. But I faced an issue that the file is not writing completely.I suspect that the file is overwritten every time. Please review my code and help.

public void zipAndUpload {
    byte[] buffer = new byte[1024];
    try{
    File folder = new File(outputFolder);
    if(!folder.exists()){
        folder.mkdir();
    }

    AmazonS3 s3Client = AmazonS3ClientBuilder.defaultClient();  
    S3Object object = s3Client.getObject("mybucket.s3.com","MyFilePath/MyZip.zip");

    TransferManager tm = TransferManagerBuilder.standard()
            .withS3Client(s3Client)
            .build();

    ZipInputStream zis = 
        new ZipInputStream(object.getObjectContent());
    ZipEntry ze = zis.getNextEntry();

    while(ze!=null){    
    String fileName = ze.getName();
    System.out.println("ZE " + ze + " : " + fileName);

          File newFile = new File(outputFolder + File.separator + fileName);
          if (ze.isDirectory()) {
              System.out.println("DIRECTORY" + newFile.mkdirs());
          }
          else {
              filePaths.add(newFile);
              int len;
              while ((len = zis.read(buffer)) > 0) {

                  ObjectMetadata meta = new ObjectMetadata();
                  meta.setContentLength(len);
                  InputStream targetStream = new ByteArrayInputStream(buffer);

                  PutObjectRequest request = new PutObjectRequest("mybucket.s3.com", fileName, targetStream ,meta); 
                  request.setGeneralProgressListener(new ProgressListener() {
                      public void progressChanged(ProgressEvent progressEvent) {
                          System.out.println("Transferred bytes: " + progressEvent.getBytesTransferred());
                      }
                  });
                  Upload upload = tm.upload(request);
                 }
          }  
           ze = zis.getNextEntry();
    }

       zis.closeEntry();
       zis.close(); 
       System.out.println("Done");  
   }catch(IOException ex){
      ex.printStackTrace(); 
   }
    }
Shajesh
  • 71
  • 7

2 Answers2

5

The problem is your inner while loop. Basically you're reading 1024 bytes from the ZipInputStream and upload those into S3. Instead of streaming into S3, you will overwrite the target key again and again and again.

The solution to this is a bit more complex because you don't have one stream per file but one stream per zip container. This means you can't do something like below because the stream will be closed by AWS after the first upload is done

// Not possible
PutObjectRequest request = new PutObjectRequest(targetBucket, name, 
zipInputStream, meta);

You have to write the ZipInputStream into a PipedOutputStream object - for each of the ZipEntry positions. Below is a working example

import com.amazonaws.auth.profile.ProfileCredentialsProvider;
import com.amazonaws.regions.Regions;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.model.GetObjectRequest;
import com.amazonaws.services.s3.model.ObjectMetadata;
import com.amazonaws.services.s3.model.PutObjectRequest;
import com.amazonaws.services.s3.model.S3Object;
import com.amazonaws.services.s3.transfer.TransferManager;
import com.amazonaws.services.s3.transfer.TransferManagerBuilder;

import java.io.*;
import java.util.zip.ZipEntry;
import java.util.zip.ZipInputStream;

public class Pipes {
    public static void main(String[] args) throws IOException {

        Regions clientRegion = Regions.DEFAULT;
        String sourceBucket = "<sourceBucket>";
        String key = "<sourceArchive.zip>";
        String targetBucket = "<targetBucket>";

        PipedOutputStream out = null;
        PipedInputStream in = null;
        S3Object s3Object = null;
        ZipInputStream zipInputStream = null;

        try {
            AmazonS3 s3Client = AmazonS3ClientBuilder.standard()
                    .withRegion(clientRegion)
                    .withCredentials(new ProfileCredentialsProvider())
                    .build();

            TransferManager transferManager = TransferManagerBuilder.standard()
                    .withS3Client(s3Client)
                    .build();

            System.out.println("Downloading an object");
            s3Object = s3Client.getObject(new GetObjectRequest(sourceBucket, key));
            zipInputStream = new ZipInputStream(s3Object.getObjectContent());

            ZipEntry zipEntry;
            while (null != (zipEntry = zipInputStream.getNextEntry())) {

                long size = zipEntry.getSize();
                String name = zipEntry.getName();
                if (zipEntry.isDirectory()) {
                    System.out.println("Skipping directory " + name);
                    continue;
                }

                System.out.printf("Processing ZipEntry %s : %d bytes\n", name, size);

                // take the copy of the stream and re-write it to an InputStream
                out = new PipedOutputStream();
                in = new PipedInputStream(out);

                ObjectMetadata metadata = new ObjectMetadata();
                metadata.setContentLength(size);

                PutObjectRequest request = new PutObjectRequest(targetBucket, name, in, metadata);

                transferManager.upload(request);

                long actualSize = copy(zipInputStream, out, 1024);
                if (actualSize != size) {
                    throw new RuntimeException("Filesize of ZipEntry " + name + " is wrong");
                }

                out.flush();
                out.close();
            }
        } finally {
            if (out != null) {
                out.close();
            }
            if (in != null) {
                in.close();
            }
            if (s3Object != null) {
                s3Object.close();
            }
            if (zipInputStream != null) {
                zipInputStream.close();
            }
            System.exit(0);
        }
    }

    private static long copy(final InputStream input, final OutputStream output, final int buffersize) throws IOException {
        if (buffersize < 1) {
            throw new IllegalArgumentException("buffersize must be bigger than 0");
        }
        final byte[] buffer = new byte[buffersize];
        int n = 0;
        long count=0;
        while (-1 != (n = input.read(buffer))) {
            output.write(buffer, 0, n);
            count += n;
        }
        return count;
    }
}
Michel Feldheim
  • 17,625
  • 5
  • 60
  • 77
  • 1
    I like you are using Lombok, but others might be confused by this. You should add a note that you are using Lombok or even better, rewrite your code so others can use it without installing third party tools. Elegant answer otherwise! – Christian Oct 23 '19 at 13:33
  • Thanks. I have similar usecase and it works perfectly for me for 50% of the usecase. Issues : 1. long size = zipEntry.getSize(); : It's coming as -1 for me. So I have created metadata like this : `ObjectMetadata metadata = new ObjectMetadata(); metadata.setContentType("application/octet-stream"); ` 2. I am able to unzip and upload only images. It's failing for text files. Any pointers to fix this code for text file ? – Srb Jan 05 '21 at 01:23
  • I have created new question to have more visibility : https://stackoverflow.com/questions/65587552/s3-stream-upload-failing-for-txt-file-but-working-for-images – Srb Jan 06 '21 at 03:41
0

I faced a similar problem and have solved it by utilising Java s3 sdk library. As you say key here is that since the files are large you want to "stream" the content, without keeping any data in memory or writing to disk.

I've made a library that can be used for this purpose and is available in Maven Central, here is the GitHub link: nejckorasa/s3-stream-unzip

nejckorasa
  • 589
  • 3
  • 8
  • 17