0

I am just beginning with Scala development and am trying to filter out unnecessary lines from an iterator using filter and collect. But the operation seems to be too slow.

val src = Source.fromFile("/home/Documents/1987.csv") // 1.2 Million
val iter = src.getLines().map(_.split(":"))
val iter250 = iter.take(250000) // Only interested in the first 250,000

val intrestedIndices = range(1, 100000, 3).toSeq // This could be any order

val slicedData = iter250.zipWithIndex

// Takes 3 minutes
val firstCase = slicedData.collect { case (x, i) if intrestedIndices.contains(i) => x }.size 

// Takes 3 minutes
val secondCase = slicedData.filter(x => intrestedIndices.contains(x._2)).size 

// Takes 1 second
val thirdCase = slicedData.collect { case (x,i ) if i % 3 == 0 => x}.size   

It appears the intrestedIndices.contains(_) part is slowing down the program in the first and second case. Is there an alternative way to speed this process up.

Lawan subba
  • 610
  • 3
  • 7
  • 19

2 Answers2

1

This answer helped solve the problem.

You iterate over all interestedIndices in first two cases in linear time. Use Set instead of Seq to improve performance – Sergey Lagutin

Lawan subba
  • 610
  • 3
  • 7
  • 19
0

For the record, here's a method to filter with an (ordered) Seq of indices, not necessarily equidistant, without scanning the indices at each step:

def filterInteresting[T](it: Iterator[T], indices: Seq[Int]): Iterator[T] =
  it.zipWithIndex.scanLeft((indices, None: Option[T])) {
    case ((indices, _), (elem, index)) => indices match {
      case h :: t if h == index => (t, Some(elem))
      case l => (l, None)
    }
  }.map(_._2).flatten
devkat
  • 1,624
  • 14
  • 15