Non Deterministic Behaviour of UNION of RDD in Spark

Question

I'm performing Union operation on 3 RDD's, I'm aware Union doesn't preserve ordering but my in my case it is quite weird. Can someone explain me what's wrong in my code??

I've a (myDF)dataframe of rows and converted to RDD :-

myRdd = myDF.rdd.map(row => row.toSeq.toList.mkString(":")).map(rec => (2, rec))

myRdd.collect
/*
Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087
Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287
Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887
*/

val rowCount = myRdd.count() // Count of Records in myRdd

val header = "name:country:date:nextdate:1" // random header

// Generating Header Rdd
headerRdd = sparkContext.parallelize(Array(header), 1).map(rec => (1, rec))

//Generating Trailer Rdd
val trailerRdd = sparkContext.parallelize(Array("T" + ":" + rowCount),1).map(rec => (3, rec))

//Performing Union
val unionRdd = headerRdd.union(myRdd).union(trailerdd).map(rec => rec._2)
unionRdd.saveAsTextFile("pathLocation")

As Union doesn't preserve ordering it should not give below result

Output

name:country:date:nextdate:1
Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087
Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287
Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887
T:3

Without using any sorting, How's that possible to get above output??

sortByKey("true", 1)

But When I Remove map from headerRdd, myRdd & TrailerRdd the oder is like

Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087
name:country:date:nextdate:1
Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287
Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887
T:3

What is the possible reason for above behaviour??

score 0 · Answer 1 · answered Jun 09 '20 at 05:38

0

In Spark, the elements within a particular partition are unordered, however the partitions themselves are ordered check this

answered Jun 09 '20 at 05:38

pandiarajan rajangam

11
3

It's not odering when i remove map from rdd's. `headerRdd = sparkContext.parallelize(Array(header), 1)` `val trailerRdd = sparkContext.parallelize(Array("T" + ":" + rowCount),1)` `myRdd = myDF.rdd.map(row => row.toSeq.toList.mkString(":"))` `val unionRdd = headerRdd.union(myRdd).union(trailerdd).saveAsTextFile("pathLocation")` – Deepak Jun 09 '20 at 07:14

Non Deterministic Behaviour of UNION of RDD in Spark

Output

1 Answers1