S3 / MinIO with Java / Scala: Saving byte buffers chunks of files to object storage

Question

So, imagine that I have a Scala Vert.x Web REST API that receives file uploads via HTTP multipart requests. However, it doesn't receive the incoming file data as a single InputStream. Instead, each file is received as a series of byte buffers handed over via a few callback functions.

The callbacks basically look like this:

  // the callback that receives byte buffers (chunks) of the file being uploaded
  //  it is called multiple times until the full file has been received
  upload.handler { buffer =>
    // send chunk to backend
  }

  // the callback that gets called after the full file has been uploaded
  //  (i.e. after all chunks have been received)
  upload.endHandler { _ =>
    // do something after the file has been uploaded
  }

  // callback called if an exception is raised while receiving the file
  upload.exceptionHandler { e =>
    // do something to handle the exception
  }

Now, I'd like to use these callbacks to save the file into a MinIO Bucket (MinIO, if you're unfamiliar, is basically self-hosted S3 and it's API is pretty much the same as the S3 Java API).

Since I don't have a file handle, I need to use putObject() to put an InputStream into MinIO.

The inefficient work-around I'm currently using with the MinIO Java API looks like this:

// this is all inside the context of handling a HTTP request
val out = new PipedOutputStream()
val in = new PipedInputStream()
var size = 0
in.connect(out)

upload.handler { buffer =>
    s.write(buffer.getBytes)
    size += buffer.length()
}

upload.endHandler { _ =>
    minioClient.putObject(
        PutObjectArgs.builder()
            .bucket("my-bucket")
            .object("my-filename")
            .stream(in, size, 50000000)
            .build())
}

Obviously, this isn't optimal. Since I'm using a simple java.io stream here, the entire file ends up getting loaded into memory.

I don't want to save the File to disk on the server before putting it into object storage. I'd like to put it straight into my object storage.

How could I accomplish this using the S3 API and a series of byte buffers given to me via the upload.handler callback?

EDIT

I should add that I am using MinIO because I cannot use a commercially-hosted cloud solution, like S3. However, as mentioned on MinIO's website, I can use Amazon's S3 Java SDK while using MinIO as my storage solution.

I attempted to follow this guide on Amazon's website for uploading objects to S3 in chunks.

That solution I attempted looks like this:

      context.request.uploadHandler { upload =>
        println(s"Filename: ${upload.filename()}")

        val partETags = new util.ArrayList[PartETag]
        val initRequest = new InitiateMultipartUploadRequest("docs", "my-filekey")
        val initResponse = s3Client.initiateMultipartUpload(initRequest)

        upload.handler { buffer =>
          println("uploading part", buffer.length())
          try {
            val request = new UploadPartRequest()
              .withBucketName("docs")
              .withKey("my-filekey")
              .withPartSize(buffer.length())
              .withUploadId(initResponse.getUploadId)
              .withInputStream(new ByteArrayInputStream(buffer.getBytes()))

            val uploadResult = s3Client.uploadPart(request)
            partETags.add(uploadResult.getPartETag)
          } catch {
            case e: Exception => println("Exception raised: ", e)
          }
        }

        // this gets called for EACH uploaded file sequentially
        upload.endHandler { _ =>
          // upload successful
          println("done uploading")
          try {
            val compRequest = new CompleteMultipartUploadRequest("docs", "my-filekey", initResponse.getUploadId, partETags)
            s3Client.completeMultipartUpload(compRequest)
          } catch {
            case e: Exception => println("Exception raised: ", e)
          }
          context.response.setStatusCode(200).end("Uploaded")
        }
        upload.exceptionHandler { e =>
          // handle the exception
          println("exception thrown", e)
        }
      }
    }

This works for files that are small (my test small file was 11 bytes), but not for large files.

In the case of large files, the processes inside the upload.handler get progressively slower as the file continues to upload. Also, upload.endHandler is never called, and the file somehow continues uploading after 100% of the file has been uploaded.

However, as soon as I comment out the s3Client.uploadPart(request) portion inside upload.handler and the s3Client.completeMultipartUpload parts inside upload.endHandler (basically throwing away the file instead of saving it to object storage), the file upload progresses as normal and terminates correctly.

I have added an attempt I made to use the AWS S3 Java API to put the file object into MinIO. — foxtrotuniform6969, Dec 05 '20 at 20:52
https://zengularity.github.io/benji/s3/usage.html is working in a streaming way with any S3 compatible service (AWS, Ceph, Minio), using Akka-Stream — cchantep, Dec 05 '20 at 21:20
Yeah, exactly. I didn’t ask for a library recommendation. Why would I want to use Akka to solve this one problem if I have nothing to gain from the rest of it? — foxtrotuniform6969, Dec 06 '20 at 03:45
Akka, FS2 ... asking why not using a streaming lib to stream seems at least weird to me ... — cchantep, Dec 06 '20 at 14:36
I'm already using Vert.X (as you may have been able to tell, based on the callbacks), and I would hesitate to use any large additional library unless it's truly needed. So far, I've been able to handle WebSockets, Auth, and everything else that I need. I would not want to involve a large library to resolve a problem with only one HTTP endpoint/route — foxtrotuniform6969, Dec 06 '20 at 14:42
Do you think that I should re-post the question with more of a focus on Vert.X? Would that help? — foxtrotuniform6969, Dec 06 '20 at 14:44

score 2 · Accepted Answer · answered Dec 06 '20 at 22:08

I figured out what I was doing wrong (when using the S3 client). I was not accumulating bytes inside my upload.handler. I need to accumulate bytes until the buffer size is big enough to upload a part, rather than upload each time I receive a few bytes.

Since neither Amazon's S3 client nor the MinIO client did what I want, I decided to dig into how putObject() was actually implemented and make my own. This is what I came up with.

This implementation is specific to Vert.X, however it can easily be generalized to work with built-in java.io InputStreams via a while loop and using a pair of Piped- streams.

This implementation is also specific to MinIO, but it can easily be adapted to use the S3 client since, for the most part, the two APIs are the same.

In this example, Buffer is basically a container around a ByteArray and I'm not really doing anything special here. I replaced it with a byte array to ensure that it would still work, and it did.

package server

import com.google.common.collect.HashMultimap
import io.minio.MinioClient
import io.minio.messages.Part
import io.vertx.core.buffer.Buffer
import io.vertx.core.streams.ReadStream

import scala.collection.mutable.ListBuffer

class CustomMinioClient(client: MinioClient) extends MinioClient(client) {
  def putReadStream(bucket: String = "my-bucket",
                    objectName: String,
                    region: String = "us-east-1",
                    data: ReadStream[Buffer],
                    objectSize: Long,
                    contentType: String = "application/octet-stream"
                   ) = {
    val headers: HashMultimap[String, String] = HashMultimap.create()
    headers.put("Content-Type", contentType)
    var uploadId: String = null

    try {
      val parts = new ListBuffer[Part]()
      val createResponse = createMultipartUpload(bucket, region, objectName, headers, null)
      uploadId = createResponse.result.uploadId()

      var partNumber = 1
      var uploadedSize = 0

      // an array to use to accumulate bytes from the incoming stream until we have enough to make a `uploadPart` request
      var partBuffer = Buffer.buffer()

      // S3's minimum part size is 5mb, excepting the last part
      // you should probably implement your own logic for determining how big
      // to make each part based off the total object size to avoid unnecessary calls to S3 to upload small parts.
      val minPartSize = 5 * 1024 * 1024

      data.handler { buffer =>

        partBuffer.appendBuffer(buffer)

        val availableSize = objectSize - uploadedSize - partBuffer.length

        val isMinPartSize = partBuffer.length >= minPartSize
        val isLastPart = uploadedSize + partBuffer.length == objectSize

        if (isMinPartSize || isLastPart) {

          val partResponse = uploadPart(
            bucket,
            region,
            objectName,
            partBuffer.getBytes,
            partBuffer.length,
            uploadId,
            partNumber,
            null,
            null
          )

          parts.addOne(new Part(partNumber, partResponse.etag))
          uploadedSize += partBuffer.length
          partNumber += 1

          // empty the part buffer since we have already uploaded it
          partBuffer = Buffer.buffer()
        }
      }


      data.endHandler { _ =>
        completeMultipartUpload(bucket, region, objectName, uploadId, parts.toArray, null, null)
      }

      data.exceptionHandler { exception =>
        // should also probably abort the upload here
        println("Handler caught exception in custom putObject: " + exception)
      }
    } catch {
      // and abort it here as well...
      case e: Exception =>
        println("Exception thrown in custom `putObject`: " + e)
        abortMultipartUpload(
          bucket,
          region,
          objectName,
          uploadId,
          null,
          null
        )
    }
  }
}

This can all be used pretty easily.

First, set up the client:

  private val _minioClient = MinioClient.builder()
    .endpoint("http://localhost:9000")
    .credentials("my-username", "my-password")
    .build()

  private val myClient = new CustomMinioClient(_minioClient)

Then, where you receive the upload request:

      context.request.uploadHandler { upload =>
        myClient.putReadStream(objectName = upload.filename(), data = upload, objectSize = myFileSize)
        context.response().setStatusCode(200).end("done")
      }

The only catch with this implementation is that you need to know the file sizes in advance for the request.

However, this can easily be solved the way I did it, especially if you're using a web UI.

Before attempting to upload the files, send a request to the server containing a map of file name to file size.
That pre-request should generate a unique ID for the upload.
The server can save group of filename->filesize using the upload ID as an index. - Server sends the upload ID back to the client.
Client sends the multipart upload request using the upload ID
Server pulls the list of files and their sizes and uses it to call .putReadStream()

S3 / MinIO with Java / Scala: Saving byte buffers chunks of files to object storage

1 Answers1