1

I have a class that takes a local file, transforms it, and stores it in GCS:

import java.nio.channels.Channels
import java.nio.file.{ Files, Path }
import java.util.zip.{ GZIPOutputStream, ZipInputStream }

import com.google.cloud.storage.{ BlobInfo, Storage }
import com.google.common.io.ByteStreams
import org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream
import org.apache.commons.io.IOUtils
import resource._


class GcsService(gcsStorage: Storage) {

  def storeFileInGcs(localPath: Path, destination: FileDestination): Unit = {
    val blobInfo = BlobInfo.newBuilder(destination.bucket, destination.path).build

    if (destination.unzipGzip) {
      for (input ← managed(new ZipInputStream(Files.newInputStream(localPath)));
           output ← managed(new GZIPOutputStream(Channels.newOutputStream(gcsStorage.writer(blobInfo))))) {
        ByteStreams.copy(input, output)
      }
    } else if (destination.decompressBzip2) {
      for (input <- managed(new BZip2CompressorInputStream(Files.newInputStream(localPath)));
           output <- managed(Channels.newOutputStream(gcsStorage.writer(blobInfo)))) {
        ByteStreams.copy(input, output)
      }
    } else {
      for (input <- managed(Files.newInputStream(localPath));
           output <- managed(Channels.newOutputStream(gcsStorage.writer(blobInfo)))) {
        IOUtils.copy(input, output)
      }
    }
  }

}

case class FileDestination(unzipGzip: Boolean, decompressBzip2: Boolean, bucket: String, path: String)

I am trying to remove some code duplication, in particular the creation of the fileInputStream and gcsOutputStream. But I cannot simply extract those variables at the top of the method, because it would create resources outside of the scala-arm managed block:

import java.io.{ InputStream, OutputStream }
import java.nio.channels.Channels
import java.nio.file.{ Files, Path }
import java.util.zip.{ GZIPOutputStream, ZipInputStream }

import com.google.cloud.storage.{ BlobInfo, Storage }
import com.google.common.io.ByteStreams
import org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream
import org.apache.commons.io.IOUtils
import resource._


class GcsService(gcsStorage: Storage) {

  def storeFileInGcs(localPath: Path, destination: FileDestination): Unit = {
    val blobInfo = BlobInfo.newBuilder(destination.bucket, destination.path).build

    // FIXME: creates a resource outside of the ARM block
    val fileInputStream = Files.newInputStream(localPath)
    val gcsOutputStream = Channels.newOutputStream(gcsStorage.writer(blobInfo))

    if (destination.unzipGzip) {
      unzipGzip(fileInputStream, gcsOutputStream)
    } else if (destination.decompressBzip2) {
      decompressBzip2(fileInputStream, gcsOutputStream)
    } else {
      copy(fileInputStream, gcsOutputStream)
    }
  }

  private def unzipGzip(inputStream: InputStream, outputStream: OutputStream): Unit = {
    for (input ← managed(new ZipInputStream(inputStream));
         output ← managed(new GZIPOutputStream(outputStream))) {
      ByteStreams.copy(input, output)
    }
  }

  private def decompressBzip2(inputStream: InputStream, outputStream: OutputStream): Unit = {
    for (input <- managed(new BZip2CompressorInputStream(inputStream));
         output <- managed(outputStream)) {
      ByteStreams.copy(input, output)
    }
  }

  private def copy(inputStream: InputStream, outputStream: OutputStream): Unit = {
    for (input <- managed(inputStream);
         output <- managed(outputStream)) {
      IOUtils.copy(input, output)
    }
  }
}

case class FileDestination(unzipGzip: Boolean, decompressBzip2: Boolean, bucket: String, path: String)

As you can see, the code is a lot clearer and more testable, but resources are not handled correctly since they are not "managed". As an example, if an exception is thrown when creating gcsOutputStream, fileInputStream won't be closed.

I could probably solve this using Google Guava sources and sinks, but I am wondering if there is a better way of doing this in Scala, without introducing Guava. Ideally using the standard library, or a scala-arm feature, or maybe even in Cats?

  • Should I define fileInputStream and gcsOutputStream as functions that take nothing and return the stream? It seems the code will be more verbose with () => InputStream and () => OutputStream everywhere?
  • Should I use multiple scala-arm "managed" for comprehensions (one to define fileInputStream and gcsOutputStream, and another one inside each sub-function)? If I do that, isn't it a problem that the "inner" inputstream will be closed twice?
  • Is there a clean and "scalaish" approach to doing this that I am not seeing?
Krzysztof Atłasik
  • 21,985
  • 6
  • 54
  • 76
Etienne Neveu
  • 12,604
  • 9
  • 36
  • 59
  • 3
    Are you on Scala 2.13? Have you considered using [scala.util.Using](https://www.scala-lang.org/files/archive/api/current/scala/util/Using$.html) for some (all?) of your resource management? – jwvh Jan 22 '20 at 22:52
  • Very interesting, I didn't know about that new feature! Sadly, we are still on Scala 2.12, and it would be a big effort to migrate to Scala 2.13 due to technical debt, so probably in a few months... I'll migrate our `scala-arm` code to `Using` when we do. Is there a clean way to solve what I am trying to do with `Using` that is not available with `scala-arm`? If so, that might be an interesting answer to this question, and I guess I could use Guava sources/sinks until we migrate. – Etienne Neveu Jan 22 '20 at 23:01
  • 2
    I would recommend using an _effect system_ which provides full resource management. Like [**cats-effect** `Resource`](https://typelevel.org/cats-effect/api/cats/effect/Resource.html) or [**zio** `Managed`](https://zio.dev/docs/datatypes/datatypes_managed). – Luis Miguel Mejía Suárez Jan 22 '20 at 23:11
  • 2
    Or you can have a look at a Scala DSL for GCS that offer streaming and resource mngt like [Benji](https://zengularity.github.io/benji/) (I'm contributor of) – cchantep Jan 22 '20 at 23:30
  • @LuisMiguelMejíaSuárez : indeed, using an `effect system` might be the best approach to solve this cleanly. I guess it would mean moving from our current "java style" to a more "functional programming style". I'm going to look into cats-effect / zio. I would be really interested to see an answer that shows how the code would look with that approach. @cchantep : thanks a lot for suggesting Benji, I didn't know about it and it looks very interesting. Sadly, I can't add too many dependencies like Play / Akka, we are actually trying to reduce the number of dependencies in this project. – Etienne Neveu Jan 23 '20 at 10:54

1 Answers1

1

You could refactor it like this:

First, declare managed resources:

val fileInputStream: ManagedResource[InputStream] = managed(Files.newInputStream(localPath))
val gcsOutputStream: ManagedResource[OutputStream] = managed(Channels.newOutputStream(gcsStorage.writer(blobInfo)))

It doesn't open these resources, it's just declaration, that you want these resources to be managed.

Then you could use map to wrap them in desired decorators (like ZipInputStream):

if (destination.unzipGzip) {
  for (input ← fileInputStream.map(s => new ZipInputStream(s));
       output ← gcsOutputStream.map(s => new GZIPOutputStream(s))) {
    ByteStreams.copy(input, output)
  }
} else if (destination.decompressBzip2) {
  for (input <- fileInputStream.map(s => new BZip2CompressorInputStream(s));
       output <- gcsOutputStream) {
    ByteStreams.copy(input, output)
  }
} else {
  for (input <- fileInputStream;
       output <- gcsOutputStream) {
    IOUtils.copy(input, output)
  }
}

Of course ManagedResource[A] is just value, so you can even pass it to a method as parameter:

private def unzipGzip(inputStream: Managed[InputStream], outputStream: Managed[OutputStream]): Unit = {
  for (input ← inputStream.map(s => new ZipInputStream(s));
       output ← outputStream.map(s => new GZIPOutputStream(s))) {
    ByteStreams.copy(input, output)
  }
}
Krzysztof Atłasik
  • 21,985
  • 6
  • 54
  • 76
  • 1
    That is an interesting approach, but the "wrapping" `ZipInputStream` / `GZIPOutputStream` / `BZip2CompressorInputStream` sadly won't be closed, only the inner `fileInputStream`/ `gcsOutputStream `. That's because the scala-arm `map()` method does not "manage" the result of the map operation (which may be anything, not always a wrapping resource). It wouldn't be a problem if the wrapping resources were stateless, but sadly that's not the case (e.g. we need to close the `ZipInputStream`'s inflater to close the associated native ZLIB resources). – Etienne Neveu Jan 23 '20 at 10:42
  • So the behavior would be similar to having a single for-comprehension at the top that "manages" the `fileInputStream` / `gcsOutputStream`, and then wrapping those streams in sub-methods without using additional `managed` for-comprehensions. – Etienne Neveu Jan 23 '20 at 10:44
  • @EtienneNeveu Maybe it could be done with flatMap, I will check it and update my answer. – Krzysztof Atłasik Jan 23 '20 at 11:14