Scala - grouping on an ordered iterator lazily

Question

I have an Iterator[Record] which is ordered on record.id this way:

record.id=1
record.id=1
...
record.id=1
record.id=2
record.id=2
..
record.id=2

Records of a specific ID could occur a large number of times, so I want to write a function that takes this iterator as input, and returns an Iterator[Iterator[Record]] output in a lazy manner.

I was able to come up with the following, but it fails on StackOverflowError after 500K records or so:

def groupByIter[T, B](iterO: Iterator[T])(func: T => B): Iterator[Iterator[T]] = new Iterator[Iterator[T]] {
    var iter = iterO
    def hasNext = iter.hasNext

    def next() = {
      val first = iter.next()
      val firstValue = func(first)
      val (i1, i2) = iter.span(el => func(el) == firstValue)
      iter = i2
      Iterator(first) ++ i1
    }
  }

What am I doing wrong?

Grouping assumes you iterate over whole collection and group all values. I don't think that it is possible to do lazily — Nyavro, Nov 27 '15 at 10:16

Odomontois · Accepted Answer · 2015-12-10T06:35:02.830

5

Trouble here is that each Iterator.span call makes another stacked closure for trailing iterator, and without any trampolining it's very easy to overflow.

Actually I dont think there is an implementation, which is not memoizing elements of prefix iterator, since followed iterator could be accessed earlier than prefix is drain out.

Even in .span implementation there is a Queue to memoize elements in the Leading definition.

So easiest implementation that I could imagine is the following via Stream.

implicit class StreamChopOps[T](xs: Stream[T]) {
  def chopBy[U](f: T => U): Stream[Stream[T]] = xs match {
    case x #:: _ =>
      def eq(e: T) = f(e) == f(x)
      xs.takeWhile(eq) #:: xs.dropWhile(eq).chopBy(f)
    case _ => Stream.empty
  }
}

Although it could be not the most performant as it memoize a lot. But with proper iterating of that, GC should handle problem of excess intermediate streams.

You could use it as myIterator.toStream.chopBy(f)

Simple check validates that following code can run without SO

Iterator.fill(10000000)(Iterator(1,1,2)).flatten //1,1,2,1,1,2,...
  .toStream.chopBy(identity)                     //(1,1),(2),(1,1),(2),...
  .map(xs => xs.sum * xs.size).sum               //60000000

edited Dec 10 '15 at 06:35

answered Nov 27 '15 at 10:32

Odomontois

15,918
2
36
71

1

There is a small typo. You've to apply the `f` in `chopBy` like `xs.takeWhile(e => f(e) == f(x)) #:: xs.dropWhile(e => f(e) == f(x)).chopBy(f)`. – jeffreyveon Dec 10 '15 at 04:44
@jeffreyveon definitely! Thank you – Odomontois Dec 10 '15 at 05:43
I tried to use the proposed method with foreach in the end of the call call chain, but got memory overflow. It looks like it first iterates the whole stream while chopping it as needed, and only after all chops are ready calls foreach on each of them. So actually the stream behavior is lost since all the stream is loaded into memory at once. – sashaostr Feb 09 '17 at 12:28
@alexanderostrikov It definitely can not handle very long sequences of similar elements since it's waiting for each to end before sending to the downstream. – Odomontois Feb 09 '17 at 12:51
@Odomontois Trying to reproduce the problem in simple code that can be posted here, I found something that I can't explain based my current scala knowledge. So actually your code looks like working as expected (sorry for the first comment) when I call it like this: `Iterator.range(0,1000000000,1).toStream .chopBy(x=> x % 20 == 0) .map(s => s.toList) .foreach(println)` But when I call it like this: `val chopped= Iterator.range(0,1000000000,1).toStream .chopBy(x=> x % 20 == 0) .map(s => s.toList) chopped.foreach(println)` I get **OutOfMemoryError** – sashaostr Feb 09 '17 at 13:20
@alexanderostrikov This is a well-known trouble. Whole memory efficiency based on fact that stream elements become unreferenced just after they are used in `fold`, `foreach`, etc and garbage collected very quickly. But in case where Stream instance is assigned to `val` they all stay referenced and jvm can't reuse the memory via GC – Odomontois Feb 09 '17 at 18:14
You can just try to replace `val` with `def` – Odomontois Feb 10 '17 at 06:22
The problem is that my actual code doesn't really have this kind of problem, I introduced it while creating toy problem to reproduce the problem. – sashaostr Feb 12 '17 at 08:46
My real is all chaining from StreamResult which is object that returns stream of sorted rows from db (using cursor inside of it). On that object I'm chaining the chopBy and then map.map.sliding.foreach. I'm not using val ref in the middle all methos are just chained on the initial stream. – sashaostr Feb 12 '17 at 09:07
I have val declarations inside map blocks, but it seems to me that it doesn't created problem you mentioned. – sashaostr Feb 12 '17 at 09:09
@alexanderostrikov `val`s inside `map` are just local variables for mapping function and released as soon as another iteration ends, this is whole different to application lifetime `val` in the `object` – Odomontois Feb 12 '17 at 09:12
@Odomontois Yup, that what I think also. – sashaostr Feb 12 '17 at 09:59
Now I tried to chain just chopBy on top of stream from db and then just foreach(println) - getting OutOfMemory. When I chain foreach(println) directly on stream from db - it works well. So every part works well on its own, but combined together gives OutOfMemory :( – sashaostr Feb 12 '17 at 10:01
@alexanderostrikov could you please post the code? I suppose there are very large parts involved, i.e. very large subsequences where discriminator keeps same value. – Odomontois Mar 20 '17 at 09:40
@Odomontois Hi, unfortunately I don't have it already - I ended up with implementing chopBy as extension of iterator and wrapping db cursor with the it. I can post that if someone needs. Unfortunately the code I had problems with in in my comments wasn't committed to source control. – sashaostr Mar 23 '17 at 09:37
It's also worth mentioning that when working with kinda chopBy on top stream/iterator, outOfMemory can still occur and it's not because of implementation problems of the underlying parts, but simply because of fact that a single "chop" just can be just bigger then allocated memory. Of course it's obvious, but it's not always on top of your head when you trying to understand what the hell is going with your code LOL – sashaostr Mar 23 '17 at 09:50

sashaostr · Answer 2 · 2017-03-23T10:06:40.077

Inspired by chopBy implemented by @Odomontois here is a chopBy I implemented for Iterator. Of course each bulk should fit allocated memory. It doesn't looks very elegant but it seems to work :)

implicit class IteratorChopOps[A](toChopIter: Iterator[A]) {

 def chopBy[U](f: A => U) = new Iterator[Traversable[A]] {
  var next_el: Option[A] = None
  @tailrec
  private def accum(acc: List[A]): List[A] = {
    next_el = None
    val new_acc = hasNext match {
      case true =>
        val next = toChopIter.next()
        acc match {
          case Nil =>
            acc :+ next
          case _ MatchTail t if (f(t) == f(next)) =>
            acc :+ next
          case _ =>
            next_el = Some(next)
            acc
        }
      case false =>
        next_el = None
        return acc
    }

    next_el match{
      case Some(_) =>
        new_acc
      case None => accum(new_acc)
    }
  }

  def hasNext = {
    toChopIter.hasNext || next_el.isDefined
  }
  def next: Traversable[A] = accum(next_el.toList)
}
}

And here is an extractor for matching tail:

object MatchTail {
  def unapply[A] (l: Traversable[A]) = Some( (l.init, l.last) )
}

Scala - grouping on an ordered iterator lazily

2 Answers2

Linked

Related