Scala Filter and Collect is slow

Question

I am just beginning with Scala development and am trying to filter out unnecessary lines from an iterator using filter and collect. But the operation seems to be too slow.

val src = Source.fromFile("/home/Documents/1987.csv") // 1.2 Million
val iter = src.getLines().map(_.split(":"))
val iter250 = iter.take(250000) // Only interested in the first 250,000

val intrestedIndices = range(1, 100000, 3).toSeq // This could be any order

val slicedData = iter250.zipWithIndex

// Takes 3 minutes
val firstCase = slicedData.collect { case (x, i) if intrestedIndices.contains(i) => x }.size 

// Takes 3 minutes
val secondCase = slicedData.filter(x => intrestedIndices.contains(x._2)).size 

// Takes 1 second
val thirdCase = slicedData.collect { case (x,i ) if i % 3 == 0 => x}.size

It appears the intrestedIndices.contains(_) part is slowing down the program in the first and second case. Is there an alternative way to speed this process up.

Your thirdCase is the right choise. Why do you want to use contains(_)? — Nyavro, Aug 09 '16 at 10:18
You iterate over all `interestedIndices` in first two cases in linear time. Use `Set` instead of `Seq` to improve performance — Sergii Lagutin, Aug 09 '16 at 10:19
@SergeyLagutin so as one-character change :) `val intrestedIndices = range(1, 100000, 3).toSe`**`t`** — The Archetypal Paul, Aug 09 '16 at 10:59
@SergeyLagutin thanks, that did the trick for what i was trying to do — Lawan subba, Aug 09 '16 at 16:21

score 1 · Accepted Answer · answered Aug 09 '16 at 16:23

1

This answer helped solve the problem.

You iterate over all interestedIndices in first two cases in linear time. Use Set instead of Seq to improve performance – Sergey Lagutin

answered Aug 09 '16 at 16:23

Lawan subba

610
3
7
19

score 0 · Answer 2 · answered Aug 09 '16 at 11:31

For the record, here's a method to filter with an (ordered) Seq of indices, not necessarily equidistant, without scanning the indices at each step:

def filterInteresting[T](it: Iterator[T], indices: Seq[Int]): Iterator[T] =
  it.zipWithIndex.scanLeft((indices, None: Option[T])) {
    case ((indices, _), (elem, index)) => indices match {
      case h :: t if h == index => (t, Some(elem))
      case l => (l, None)
    }
  }.map(_._2).flatten

Scala Filter and Collect is slow

2 Answers2