Is Spark zipWithIndex safe with parallel implementation?

Question

If I have a file, and I did an RDD zipWithIndex per row,

([row1, id1001, name, address], 0)
([row2, id1001, name, address], 1)
...
([row100000, id1001, name, address], 100000)

Will I be able to get the same index order if I reload the file? Since it runs in parallel, other rows may be partitioned differently?

score 8 · Answer 1 · answered Aug 06 '15 at 03:24

RDDs can be sorted, and so do have an order. This order is used to create the index with .zipWithIndex().

To get the same order each time depends upon what previous calls are doing in your program. The docs mention that .groupBy() can destroy order or generate different orderings. There may be other calls that do this as well.

I suppose you could always call .sortBy() before calling .zipWithIndex() if you needed to guarantee a specific ordering.

This is explained in the .zipWithIndex() scala API docs

public RDD<scala.Tuple2<T,Object>> zipWithIndex() Zips this RDD with its element indices. The ordering is first based on the partition index and then the ordering of items within each partition. So the first item in the first partition gets index 0, and the last item in the last partition receives the largest index. This is similar to Scala's zipWithIndex but it uses Long instead of Int as the index type. This method needs to trigger a spark job when this RDD contains more than one partitions.

Note that some RDDs, such as those returned by groupBy(), do not guarantee order of elements in a partition. The index assigned to each element is therefore not guaranteed, and may even change if the RDD is reevaluated. If a fixed ordering is required to guarantee the same index assignments, you should sort the RDD with sortByKey() or save it to a file.

using sortBy on an RDD collects it to the driver program right? I'm afraid it might result in an OOME. The sort order I want is just the default ordering of rows in the file. — sophie, Aug 06 '15 at 03:48
@sophie sorting is done in the workers, not the driver. If, after reading the API docs, you aren't certain of what will happen, then you should test it by running it a few times and spot checking the elements at certain index numbers. You can do that without loading all the data into the driver, by using .filter() with an anonymous function that yields true when the row number matches some particular row, like row 43, and following that with a .take(1) to bring that one piece of data to the driver. — Paul, Aug 06 '15 at 03:52

Is Spark zipWithIndex safe with parallel implementation?

1 Answers1

Linked