Most effective way to transfer almost 400k images to S3

Question

I am currently in charge of transferring a site from its current server to EC2, that part of the project is done and fine, the other part is the part I am struggling with, the site currently has almost 400K images, all sorted within different folders within a main userimg folder, the client wants all these images to be stored on S3 - the main problem I have is how do I transfer almost 400,000 images from the server to S3 - I have been using http://s3tools.org/s3cmd which is brilliant but if I was to transfer the userimg folder with s3cmd it is going to take almost 3 days solid, and if the connection breaks or similar problem I am going to have some images on s3 and some not, with no way to continue the process...

Can anyone suggest a solution, has anyone come up against a problem like this before?

score 3 · Accepted Answer · edited Jan 08 '13 at 09:59

I would suggest you to write (or to get someone to write) a simple Java utility that:

Reads the structure of your client directories (if needed)
For every image creates a corresponding key (according to the file structure read in 1)on s3 and starts Multi-part upload in paralel using AWS SDK or jets3t API.

I did it for our client. It is less than 200 lines of java code and it is very reliable. below is the part that does a multi-part upload.The part that reads the file structure is trivial.

/**
 * Uploads file to Amazon S3. Creates the specified bucket if it does not exist.
 * The upload is done in chunks of CHUNK_SIZE size (multi-part upload).
 * Attempts to handle upload exceptions gracefully up to MAX_RETRY times per single chunk.
 * 
 * @param accessKey     - Amazon account access key
 * @param secretKey     - Amazon account secret key
 * @param directoryName - directory path where the file resides
 * @param keyName       - the name of the file to upload
 * @param bucketName    - the name of the bucket to upload to
 * @throws Exception    - in case that something goes wrong
 */
public void uploadFileToS3(String accessKey
        ,String secretKey
        ,String directoryName
        ,String keyName // that is the file name that will be created after upload completed
        ,String bucketName ) throws Exception {

    // Create a credentials object and service to access S3 account
    AWSCredentials myCredentials =
        new BasicAWSCredentials(accessKey, secretKey);

    String filePath = directoryName
    + System.getProperty("file.separator")
    + keyName;   

    log.info("uploadFileToS3 is about to upload file [" + filePath + "]");

    AmazonS3 s3Client = new AmazonS3Client(myCredentials);        
    // Create a list of UploadPartResponse objects. You get one of these
    // for each part upload.
    List<PartETag> partETags = new ArrayList<PartETag>();

    // make sure that the bucket exists
    createBucketIfNotExists(bucketName, accessKey, secretKey);

    // delete the file from bucket if it already exists there
    s3Client.deleteObject(bucketName, keyName);

    // Initialize.
    InitiateMultipartUploadRequest initRequest = new InitiateMultipartUploadRequest(bucketName, keyName);
    InitiateMultipartUploadResult initResponse = s3Client.initiateMultipartUpload(initRequest);

    File file = new File(filePath);

    long contentLength = file.length();
    long partSize = CHUNK_SIZE; // Set part size to 5 MB.
    int numOfParts = 1;
    if (contentLength > CHUNK_SIZE) {
        if (contentLength % CHUNK_SIZE != 0) {
            numOfParts = (int)((contentLength/partSize)+1.0);
        }
        else {
            numOfParts = (int)((contentLength/partSize));
        }
    }

    try {
        // Step 2: Upload parts.
        long filePosition = 0;
        for (int i = 1; filePosition < contentLength; i++) {
            // Last part can be less than 5 MB. Adjust part size.
            partSize = Math.min(partSize, (contentLength - filePosition));

            log.info("Start uploading part[" + i + "] of [" + numOfParts + "]");

            // Create request to upload a part.
            UploadPartRequest uploadRequest = new UploadPartRequest()
            .withBucketName(bucketName).withKey(keyName)
            .withUploadId(initResponse.getUploadId()).withPartNumber(i)
            .withFileOffset(filePosition)
            .withFile(file)
            .withPartSize(partSize);

            // repeat the upload until it succeeds or reaches the retry limit
            boolean anotherPass;
            int retryCount = 0;
            do {
                anotherPass = false;  // assume everything is ok
                try {
                    log.info("Uploading part[" + i + "]");
                    // Upload part and add response to our list.
                    partETags.add(s3Client.uploadPart(uploadRequest).getPartETag());
                    log.info("Finished uploading part[" + i + "] of [" + numOfParts + "]");
                } catch (Exception e) {
                    log.error("Failed uploading part[" + i + "] due to exception. Will retry... Exception: ", e);
                    anotherPass = true; // repeat
                    retryCount++;
                }
            }
            while (anotherPass && retryCount < CloudUtilsService.MAX_RETRY);

            filePosition += partSize;
            log.info("filePosition=[" + filePosition + "]");

        }
        log.info("Finished uploading file");

        // Complete.
        CompleteMultipartUploadRequest compRequest = new 
        CompleteMultipartUploadRequest(
                bucketName, 
                keyName, 
                initResponse.getUploadId(), 
                partETags);

        s3Client.completeMultipartUpload(compRequest);

        log.info("multipart upload completed.upload id=[" + initResponse.getUploadId() + "]");
    } catch (Exception e) {
        s3Client.abortMultipartUpload(new AbortMultipartUploadRequest(
                bucketName, keyName, initResponse.getUploadId()));

        log.error("Failed to upload due to Exception:", e);

        throw e;
    }
}


/**
 * Creates new bucket with the names specified if it does not exist.
 * 
 * @param bucketName    - the name of the bucket to retrieve or create
 * @param accessKey     - Amazon account access key
 * @param secretKey     - Amazon account secret key
 * @throws S3ServiceException - if something goes wrong
 */
public void createBucketIfNotExists(String bucketName, String accessKey, String secretKey) throws S3ServiceException {
    try {
        // Create a credentials object and service to access S3 account
        org.jets3t.service.security.AWSCredentials myCredentials =
            new org.jets3t.service.security.AWSCredentials(accessKey, secretKey);
        S3Service service = new RestS3Service(myCredentials);

        // Create a new bucket named after a normalized directory path,
        // and include my Access Key ID to ensure the bucket name is unique
        S3Bucket zeBucket = service.getOrCreateBucket(bucketName);
        log.info("the bucket [" + zeBucket.getName() + "] was created (if it was not existing yet...)");
    } catch (S3ServiceException e) {
        log.error("Failed to get or create bucket[" + bucketName + "] due to exception:", e);
        throw e;
    }
}

score 1 · Answer 2 · answered Jul 11 '12 at 00:55

1

Consider Amazon S3 Bucket Explorer.

It allows you to upload files in parallel, so that should speed up the process.
The program has a job queue, so that if one of the uploads fails it will retry the upload automatically.

answered Jul 11 '12 at 00:55

Kevin Lo

11
1

score 1 · Answer 3 · answered Mar 13 '11 at 12:42

1

Sounds like a job for Rsync. I've never used it in combination with S3, but S3Sync seems like what you need.

answered Mar 13 '11 at 12:42

Wim

11,091
41
58

What is the benefit of using Rsync over say s3cmd, is it faster, safer ? would it be better if i was to split the folders into groups, say take 10 folders and upload those, once done move on to another 10 - would take a lot longer i guess... – David Mar 13 '11 at 13:30
I don't know about s3cmd, but you said that "if the connection breaks or similar problem I am going to have some images on s3 and some not, with no way to continue the process". Rsync will compare what's already on the server with what is not and only transfer the difference - so it can pick up where it left after an aborted connection. – Wim Mar 13 '11 at 13:36
Will this work with S3 as well, can it interface with S3 to check what has been uploaded... – David Mar 13 '11 at 13:39
I would guess that S3sync does do that. But I haven't tried it myself yet. – Wim Mar 13 '11 at 14:07
The process of transferring them using s3cmd has been underway for awhile now, i might just leave it and see how it turns out, there isn't any reason why it shouldn't work if everything keeps going the way it is now, just matter of time transferring them... – David Mar 13 '11 at 14:38

score 1 · Answer 4 · answered Mar 13 '11 at 12:43

1

If you don't want to actually upload all of the files (or indeed, manage it), you could use AWS Import/Export which basically entails just shipping Amazon a hard-disk.

answered Mar 13 '11 at 12:43

nickgrim

5,387
1
22
28

Yes i have seen this before but i don't think it is going to work in this situation, don't think this idea is going to wash with the client... – David Mar 13 '11 at 13:29

score 1 · Answer 5 · answered Mar 14 '11 at 08:24

You could use superflexiblefilesychronizer. It is a commercial product but the Linux version is free.

It can compare and sync the folders and multiple files can be transferred in parallel. Its fast. The interface is perhaps not the simplest, but thats mainly because it has a million configuration options.

Note: I am not affiliated in any way with this product but I have used it.

Most effective way to transfer almost 400k images to S3

5 Answers5