0

Answering this question, Odomontois showed how you can implement a lazy group-by operator that can group a pre-sorted stream by a key without having to store the whole thing in memory. Is there any way to do something like this with Akka's streams (i.e. Source objects)? Alternatively, is there any way to pull out a regular Stream object from a Akka Source so I can use Odomontois's chopBy?

Here's an utterly failed attempt to do this that doesn't work:

  implicit class SourceChopOps[T, NU](s: Source[T, NU]) {
    def chopBy[U](f: T => U) = {
      s.prefixAndTail(1)
        .map(pt => (pt._1.head, pt._2))
        .map {
          case (prefix, tail) =>
            // what to do with pulled off head???
            tail.takeWhile(e => f(e) == f(prefix)) ++ tail.dropWhile(e => f(e) == f(prefix)).chopBy(f) // fails here
        }
      }
    }
  }
Community
  • 1
  • 1
  • have you checked the official doc? http://doc.akka.io/docs/akka/2.4.9/scala/stream/stream-cookbook.html#implementing-reduce-by-key – fGo Sep 01 '16 at 18:15
  • Thanks for the info @fGo . Does the Akka groupBy somehow get around the need to hold most intermediate data in memory? Does it need to hold onto the data for every substream before returning them? Or does it avoid having to do this with some really neat flow-control tricks? This former condition was the primary impetus behind chopBy, it only needs to hold data in memory for a single key at a time. – Choppy The Lumberjack Sep 01 '16 at 21:41

1 Answers1

0

groupBy in Akka Streams will keep the key that you are grouping by in memory, but the streams area always "lazy" as they have back-pressure, so it will run in bounded memory. If downstream does not accept new elements, no new elements will be produced upstream.

So for example:

case class Record(id: Int)
Source.fromIterator(() => 
    Iterator
      .fill(1000)(Iterator(1,2).map { n => println("creating"); Record(n) })
      .flatten)
  .groupBy(maxSubstreams = 2, _.id)
  .map { r => println("Consuming"); r }
  .fold(0)((acc, _) => acc + 1)
  .mergeSubstreams
  .runForeach(println)

Will show you how Record instances are produced as fast as they can be consumed in each of the two substreams rather than all of them upfront.

johanandren
  • 11,249
  • 1
  • 25
  • 30