11

I am new to Apache Spark, and I know that the core data structure is RDD. Now I am writing some apps which require element positional information. For example, after converting an ArrayList into a (Java)RDD, for each integer in RDD, I need to know its (global) array subscript. Is it possible to do it?

As I know, there is a take(int) function for RDD, so I believe the positional information is still maintained in RDD.

wayi
  • 513
  • 1
  • 6
  • 14

2 Answers2

15

I believe in most cases, zipWithIndex() will do the trick, and it will preserve the order. Read the comments again. My understanding is that it exactly means keep the order in the RDD.

scala> val r1 = sc.parallelize(List("a", "b", "c", "d", "e", "f", "g"), 3)
scala> val r2 = r1.zipWithIndex
scala> r2.foreach(println)
(c,2)
(d,3)
(e,4)
(f,5)
(g,6)
(a,0)
(b,1)

Above example confirm it. The red has 3 partitions, and a with index 0, b with index 1, etc.

zhang zhan
  • 1,596
  • 13
  • 10
  • Thanks for your answer! In most cases this method is not bad, since the element in the input array/list may be a relatively large object. However, it may be a problem for primitive-type arrays, e.g., an integer array, because this seemingly only solution is quite inefficient, in terms of both computation and storage costs. Anyway, I am very satisfied with your answer. I hope one day naturally maintaining the index without (zipWithIndex) can become true for Spark's RDD. – wayi Sep 28 '14 at 14:41
  • Based on the design of Spark, I cannot image a good way to maintain the index of element without sacrifying the storage. – zhang zhan Sep 29 '14 at 03:28
10

Essentially, RDD's zipWithIndex() method seems to do this, but it won't preserve the original ordering of the data the RDD was created from. At least you'll get a stable ordering.

val orig: RDD[String] = ...
val indexed: RDD[(String, Long)] = orig.zipWithIndex()

The reason you're unlikely to find something that preserves the order in the original data is buried in the API doc for zipWithIndex():

"Zips this RDD with its element indices. The ordering is first based on the partition index and then the ordering of items within each partition. So the first item in the first partition gets index 0, and the last item in the last partition receives the largest index. This is similar to Scala's zipWithIndex but it uses Long instead of Int as the index type. This method needs to trigger a spark job when this RDD contains more than one partitions."

So it looks like the original order is discarded. If preserving the original order is important to you, it looks like you need to add the index before you create the RDD.

Spiro Michaylov
  • 3,531
  • 21
  • 19
  • Yes, adding array subscript as an additional attribute before creating RDD can solve this problem. However, there are two serious limitations: 1) Obviously, this additional index attribute will at least double the storage cost, and such cost can be even more, e.g., in an integer/float array, a long int field is added for index. 2) Since adding additional index values cannot be loaded into Spark, such data conversion also cannot be parallelized by Spark. Thus, I have to involve other parallel techniques to add index. – wayi Sep 26 '14 at 02:37