4

What is the best way to implement ROW_NUMBER (sequence generator) in Spark program for Billions of records (>25 Billion) ?

Sample code:

select patient_id, 
department_id, 
row_number() over (partition by department_id order by dept_id asc) as Pat_serial_Nbr
from T_patient;

Row_number() in Spark program is running for more than 4 hours and failing for 15 Billion records.

I have tried RDD method zipWithIndex() for 15 Billion of records (execution took 40 mins), it is returning expected results.

public RDD<scala.Tuple2<T,Object>> zipWithIndex() 

Zips this RDD with its element indices. The ordering is first based on the partition index and then the ordering of items within each partition. So the first item in the first partition gets index 0, and the last item in the last partition receives the largest index. This is similar to Scala's zipWithIndex but it uses Long instead of Int as the index type. This method needs to trigger a spark job when this RDD contains more than one partitions. Note that some RDDs, such as those returned by groupBy(), do not guarantee order of elements in a partition. The index assigned to each element is therefore not guaranteed, and may even change if the RDD is reevaluated. If a fixed ordering is required to guarantee the same index assignments, you should sort the RDD with sortByKey() or save it to a file.

scala> List("X", "Y", "Z").zipWithIndex
res0: List[(String, Int)] = List((X,0), (Y,1), (Z,2))

The same works for dataframe - val rdd = df.rdd.zipWithIndex

Second option:

val df_id = df.withColumn("id",monotonicallyIncreasingId)

Reference: Spark-Monotonically increasing id not working as expected in dataframe?

Could you please suggest the optimal way to generate sequence numbers in Scala Spark.

SCouto
  • 7,808
  • 5
  • 32
  • 49
Arvind Kumar
  • 1,325
  • 1
  • 19
  • 27
  • interesting question if true, so what is the question exactly? sort of lost it from the prose as I have used zipWithIndex for things fine, but no so high number. 40 min also a value judgment imho – thebluephantom Jun 25 '18 at 06:17
  • @thebluephantom 3 Thanks for your comment. Is there any other way to generate sequence other than zipWithIndex which can generate sequence for Billions of records in less time compare to zipWithIndex. – Arvind Kumar Jun 25 '18 at 09:24
  • Do you need numbers from 0 to N-1 (N being the size of the RDD) or do you just care about unicity (any number between 0 and 2^64-1 as long as they are unique)? If unicity is indeed enough, `zipWithUniqueId()` does not trigger any job and would be extremely fast. – Oli Jun 25 '18 at 09:40
  • I need the sequence starting with 0 to df.count (in sequence) – Arvind Kumar Jun 25 '18 at 10:00
  • but that is quite fast is it not? – thebluephantom Jun 25 '18 at 10:47
  • this https://bigdatadiscussion.wordpress.com/2017/06/27/spark-rdd-methods-zip-zipwithindex-and-zipwithuniqueid/ looks like a good thread. I think you need to reflect what is good or bad performance, we cannot always get things down to a singularity – thebluephantom Jun 25 '18 at 12:49
  • Then I think the oli suggestion is not going to work, but you need to think how bad that approach is - or not? – thebluephantom Jun 25 '18 at 14:17

0 Answers0