0

I have an S3 structure that's the result of a Spark job that writes partitioned CSV files like below.

bucketA
  output
    cleaned-data1
      part000....csv
      part001....csv
      part002....csv
    cleaned-data2
      .....

What I need is to be able have an Akka HTTP endpoint that points to the output file name to download all parts as a zip file: https://..../download/cleaned-data1.

When this endpoint is called, ideally I want to:

  1. Open a zip stream from the server to the client browser

  2. Open up the part files and stream the bytes into the zip stream directly to the client without any buffering on the server to avoid memory issue

The total size of all parts can get up to 30GB uncompressed.

Is there a way to do this through Akka Stream, Akka HTTP or Play? Can I utilize the Alpakka library?

Edited temporary based on Ramon's answer:

  def bucketNameToFileContents(bucket : String) : Source[ByteString, _] =
    bucketNameToKeySource(bucket)
      .map(key => S3.download(bucket, key))
      .map(x => x.map(y => y.fold(Source.empty[ByteString])(_._1)))
      .flatMapConcat(identity)
      .flatMapConcat(identity)
suriyanto
  • 1,075
  • 12
  • 19

1 Answers1

1

The first step is to create an akka stream Source of the bucket contents:

type Key = String

def bucketNameToKeySource(bucket : String) : Source[Key, _] = 
  S3.listBucket(bucket, None)
    .map(_.key)

This can now be combined with the S3 download capabilities and flatMapConcat:

def bucketNameToFileContents(bucket : String) : Source[ByteString, _] = 
  bucketNameToKeySource(bucket)
    .map(key => S3.download(bucket, key))
    .map(_.getOrElse(Source.empty[ByteString])
    .flatMapConcat(identity)

This function can now be incorporated into your Route. The question asks for "open a zip stream from the server to the client" so encodeResponse is used:

def bucketNameToRoute(parentBucketName : String) : Route = 
  encodeResponse {
    path ("download" / Segment) { childBucketName =>

      val bucketName = parentBucketName + "/" + childBucketName

      val byteStrSource = bucketNameToFileContents(bucketName)

      complete(OK -> byteStrSource)
    } 
  }
Ramón J Romero y Vigil
  • 17,373
  • 7
  • 77
  • 125
  • Thanks, Ramon! I found compile error at the third line of the second snippet: `.map(_.getOrElse(Source.empty[ByteString])` as the inner var `_` inside map is of type `Source[Option[X]]`. Do I miss any implicit import? – suriyanto Jul 19 '19 at 14:42
  • @suriyanto You are welcome. I didn't try to compile/run the code snippets. They were for demonstration purposes so there are likely some minor errors... – Ramón J Romero y Vigil Jul 20 '19 at 13:22
  • Ramon, I put how I changed the `bucketNameToFileContents` in my original post. Instead of exposing this stream in Akka Http, how would one make it available through Play? This would be part of an existing Play based service. – suriyanto Jul 21 '19 at 15:17
  • @suriyanto I don't know, i've never used `play` – Ramón J Romero y Vigil Jul 22 '19 at 10:19