1

I'm trying to consume a bunch of files from S3 in a streaming manner using akka streams:

S3.listBucket("<bucket>", Some("<common_prefix>"))
  .flatMapConcat { r => S3.download("<bucket>", r.key) }
  .mapConcat(_.toList)
  .flatMapConcat(_._1)
  .via(Compression.gunzip())
  .via(Framing.delimiter(ByteString("\n"), Int.MaxValue))
  .map(_.utf8String)
  .runForeach { x => println(x) }

Without increasing akka.http.host-connection-pool.response-entity-subscription-timeout I get

java.util.concurrent.TimeoutException: Response entity was not subscribed after 1 second. Make sure to read the response entity body or call discardBytes() on it. for the second file, just after printing the last line of the first file, when trying to access the first line of the second file.

I understand the nature of that exception. I don't understand why the request to the second file is already in progress, while the first file is still being processed. I guess, there's some buffering involved.

Any ideas how to get rid of that exception without having to increase akka.http.host-connection-pool.response-entity-subscription-timeout?

shagoon
  • 235
  • 2
  • 9
  • 1
    You could check whether you have the same issue with [Benji S3](https://zengularity.github.io/benji/) (Scala DSL for S3/GCP/... I'm a contributor of) – cchantep Jun 05 '20 at 14:14
  • 1
    Try this? https://doc.akka.io/docs/akka-http/current/implications-of-streaming-http-entity.html#integrating-with-akka-streams – yiksanchan Jun 05 '20 at 15:44
  • Thanks for your comments/suggestions. @YikSanChan: I think (hope), the documentation is misleading. For large files, you just can't consume the stream in just a second. I think, what's really meant is, that you must not have pauses longer than the configured timeout between pulling elements from that stream. Using `runReduce` on that sources effectively buffers the data in memory, which is not, what I want – and which is not streaming. @cchantep: I guess Benji S3 also buffers the whole file in memory? – shagoon Jun 05 '20 at 16:31
  • I added `.log("before download")` and `.log("after download")` just around the first `flatMapConcat`. The request for the second file is immediately sent after the first one responds. I think, that's wrong. – shagoon Jun 05 '20 at 16:33

1 Answers1

0

Instead of merging the processing of downloaded files into one stream with flatMapConcat you could try materializing the stream within the outer stream and fully process it there before emitting your output downstream. Then you shouldn't begin downloading (and fully processing) the next object until you're ready.

Generally you want to avoid having too many stream materializations to reduce overhead, but I suspect that would be negligible for an app performing network I/O like this.

Let me know if something like this works: (warning: untested)

S3.listBucket("<bucket>", Some("<common_prefix>"))
  .mapAsync(1) { result =>
    val contents = S3.download("<bucket>", r.key)
      .via(Compression.gunzip())
      .via(Framing.delimiter(ByteString("\n"), Int.MaxValue))
      .map(_.utf8String)
      .to(Sink.seq)(Keep.right)
      .run()
    contents     
  }
  .mapConcat(identity)
  .runForeach { x => println(x) }
Sean Glover
  • 1,766
  • 18
  • 31