1

I'm working on a Spark application in which I have an RDD[Array[Array[Float]] and I'd like to convert it into an RDD[Float]. I'm having the following code to do this task for me:

val values = predictions.flatMap { x => (for(y <- 0 to x.length - 1) yield x(y)).map(c => c(0)) }    

However I don't know whether the order of elements are changed after using flatMap or not? If so, is there any other solution that remain the order of elements intact?

HHH
  • 6,085
  • 20
  • 92
  • 164

2 Answers2

7

Yes, flatMap keeps the order intact. So do map, filter, etc.

Alexey Romanov
  • 167,066
  • 35
  • 309
  • 487
  • So let'say my original rdd looks loke this : rdd([[1,2],[3,4],[5,6]]), after the applying my transformation I will get Rdd(1,2,3,4,5,6). Is that correct? – HHH Jun 24 '16 at 13:43
  • Not really. But I don't know if the situation should be included. Consider this [code snippet](https://scastie.scala-lang.org/jrOA5IUQRiatUe63DVbXfw). Since the `flatMap` is applied on a `Set`, there's no order to preserve. – Max Wong Dec 17 '20 at 21:34
  • @YanqiHuang The question is about `flatMap` on `RDD`. – Alexey Romanov Dec 17 '20 at 23:54
  • @AlexeyRomanov Oh. My bad. That was a blunder. Thanks for pointing that out :) – Max Wong Dec 18 '20 at 15:50
1

I have looked into the Spark source code.

def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
}

and

private[spark] class MapPartitionsRDD[U: ClassTag, T: ClassTag](
    var prev: RDD[T],
    f: (TaskContext, Int, Iterator[T]) => Iterator[U],  // (TaskContext, partition index, iterator)
    preservesPartitioning: Boolean = false)
  extends RDD[U](prev) {

  override val partitioner = if (preservesPartitioning) firstParent[T].partitioner else None

  override def getPartitions: Array[Partition] = firstParent[T].partitions

  override def compute(split: Partition, context: TaskContext): Iterator[U] =
    f(context, split.index, firstParent[T].iterator(split, context))

  override def clearDependencies() {
    super.clearDependencies()
    prev = null
  }
}

The order of data should remain intact but some repartitioning may take place and this may or may not bother you (depending on what you are doing).

sebszyller
  • 853
  • 4
  • 10