Does flatMap keep the order intact?

Question

I'm working on a Spark application in which I have an RDD[Array[Array[Float]] and I'd like to convert it into an RDD[Float]. I'm having the following code to do this task for me:

val values = predictions.flatMap { x => (for(y <- 0 to x.length - 1) yield x(y)).map(c => c(0)) }

However I don't know whether the order of elements are changed after using flatMap or not? If so, is there any other solution that remain the order of elements intact?

@AlbertoBonsanto This is not true. Depending on order is typically not a good idea but flatMap doesn't shuffle the data. — zero323, Jun 24 '16 at 04:12
@AlbertoBonsanto RDDs do have a fixed order despite being distributed. — Alexey Romanov, Jun 24 '16 at 05:48
So let'say my original rdd looks loke this : rdd([[1,2],[3,4],[5,6]]), after the applying my transformation I will get Rdd(1,2,3,4,5,6). Is that correct? — HHH, Jun 24 '16 at 10:59
@zero323 thanks for the clarification! seems I got a wrong idea! — Alberto Bonsanto, Jun 24 '16 at 11:14

score 7 · Accepted Answer · answered Jun 24 '16 at 05:46

7

Yes, flatMap keeps the order intact. So do map, filter, etc.

answered Jun 24 '16 at 05:46

Alexey Romanov

167,066
35
309
487

So let'say my original rdd looks loke this : rdd([[1,2],[3,4],[5,6]]), after the applying my transformation I will get Rdd(1,2,3,4,5,6). Is that correct? – HHH Jun 24 '16 at 13:43
Not really. But I don't know if the situation should be included. Consider this [code snippet](https://scastie.scala-lang.org/jrOA5IUQRiatUe63DVbXfw). Since the `flatMap` is applied on a `Set`, there's no order to preserve. – Max Wong Dec 17 '20 at 21:34
@YanqiHuang The question is about `flatMap` on `RDD`. – Alexey Romanov Dec 17 '20 at 23:54
@AlexeyRomanov Oh. My bad. That was a blunder. Thanks for pointing that out :) – Max Wong Dec 18 '20 at 15:50

score 1 · Answer 2 · answered Jun 24 '16 at 10:27

I have looked into the Spark source code.

def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))
}

and

private[spark] class MapPartitionsRDD[U: ClassTag, T: ClassTag](
    var prev: RDD[T],
    f: (TaskContext, Int, Iterator[T]) => Iterator[U],  // (TaskContext, partition index, iterator)
    preservesPartitioning: Boolean = false)
  extends RDD[U](prev) {

  override val partitioner = if (preservesPartitioning) firstParent[T].partitioner else None

  override def getPartitions: Array[Partition] = firstParent[T].partitions

  override def compute(split: Partition, context: TaskContext): Iterator[U] =
    f(context, split.index, firstParent[T].iterator(split, context))

  override def clearDependencies() {
    super.clearDependencies()
    prev = null
  }
}

The order of data should remain intact but some repartitioning may take place and this may or may not bother you (depending on what you are doing).

Does flatMap keep the order intact?

2 Answers2

Linked