Technique for joining with spark dataframe w/ custom partitioner works w/ python, but not scala?

Question

I recently read an article that described how to custom partition a dataframe [ https://dataninjago.com/2019/06/01/create-custom-partitioner-for-spark-dataframe/ ] in which the author illustrated the technique in Python. I use Scala, and the technique looked like a good way to address issues of skew, so I tried something similar, and what I found was that when one does the following:

- create 2 data frames, D1, D2
- convert D1, D2 to 2 Pair RDDs R1,R2 
    (where the key is the key you want to join on)
- repartition R1,R2 with a custom partitioner 'C'
    where 'C' has 2 partitions (p-0,p-1) and 
    stuffs everything in P-1, except keys == 'a' 
- join R1,R2 as R3
- OBSERVE that:
    - partitioner for R3 is 'C' (same for R1,R2) 
    - when printing the contents of each partition of R3  all entries
      except the one keyed by 'a' is in p-1
- set D1' <- R1.toDF 
- set D2' <- R2.toDF

We note the following results:

0) The join of D1' and D2' produce expected results (good)
1) The partitioners for D1' and D2' are None -- not Some(C), 
   as was the case with RDD's R1/R2  (bad)
2) The contents of the glom'd underlying RDDs of D1' and D2' did 
    not have  everything (except key 'a') piled up 
    in  partition 1 as expected.(bad)

So, I came away with the following conclusion... which will work for me practically... But it really irks me that I could not get the behavior in the article which used Python:

When one needs to use custom partitioning with Dataframes in Scala one must
drop into RDD's do the join or whatever operation on the RDD, then convert back 
to dataframe. You can't apply the custom partitioner, then convert back to 
dataframe, do your operations, and expect the custom partitioning to work.

Now...I am hoping I am wrong ! Perhaps someone with more expertise in Spark internals can guide me here. I have written a little program (below) to illustrate the results. Thanks in advance if you can set me straight.

UPDATE

In addition to the Spark code which illustrates the problem I also tried a simplified version of what the original article presented in Python. The conversions below create a dataframe, extract its underlying RDD and repartition it, then recover the dataframe and verify that the partitioner is lost.

Python snippet illustrating problem

from pyspark.sql.types import IntegerType

mylist = [1, 2, 3, 4]
df = spark.createDataFrame(mylist, IntegerType())

def travelGroupPartitioner(key):
    return 0

dfRDD = df.rdd.map(lambda x: (x[0],x))
dfRDD2 = dfRDD .partitionBy(8, travelGroupPartitioner)
# this line uses approach of original article and maps to only the value
# but map doesn't guarantee preserving pratitioner, so i tried without the 
# map below... 
df2 = spark.createDataFrame(dfRDD2 .map(lambda x: x[1]))
print ( df2.rdd.partitioner )  # prints None

# create dataframe from partitioned RDD _without_ the map, 
# and we _still_ lose partitioner
df3 = spark.createDataFrame(dfRDD2) 
print ( df3.rdd.partitioner )       # prints None

Scala snippet illustrating problem

object Question extends App {

  val conf =
    new SparkConf().setAppName("blah").
      setMaster("local").set("spark.sql.shuffle.partitions", "2")
  val sparkSession = SparkSession.builder .config(conf) .getOrCreate()
  val spark = sparkSession

  import spark.implicits._
  sparkSession.sparkContext.setLogLevel("ERROR")

  class CustomPartitioner(num: Int) extends Partitioner {
    def numPartitions: Int = num
    def getPartition(key: Any): Int = if (key.toString == "a") 0 else 1
  }

  case class Emp(name: String, deptId: String)
  case class Dept(deptId: String, name: String)

  val value: RDD[Emp] = spark.sparkContext.parallelize(
    Seq(
      Emp("anne", "a"),
      Emp("dave", "d"),
      Emp("claire", "c"),
      Emp("roy", "r"),
      Emp("bob", "b"),
      Emp("zelda", "z"),
      Emp("moe", "m")
    )
  )
  val employee: Dataset[Emp] = value.toDS()
  val department: Dataset[Dept] = spark.sparkContext.parallelize(
    Seq(
      Dept("a", "ant dept"),
      Dept("d", "duck dept"),
      Dept("c", "cat dept"),
      Dept("r", "rabbit dept"),
      Dept("b", "badger dept"),
      Dept("z", "zebra dept"),
      Dept("m", "mouse dept")
    )
  ).toDS()


  val dumbPartitioner: Partitioner = new CustomPartitioner(2)

  // Convert to-be-joined dataframes to custom repartition RDDs [ custom partitioner:  cp ]
  //
  val deptPairRdd: RDD[(String, Dept)] = department.rdd.map { dept => (dept.deptId, dept) }
  val empPairRdd: RDD[(String, Emp)] = employee.rdd.map { emp: Emp => (emp.deptId, emp) }

  val cpEmpRdd: RDD[(String, Emp)] = empPairRdd.partitionBy(dumbPartitioner)
  val cpDeptRdd: RDD[(String, Dept)] = deptPairRdd.partitionBy(dumbPartitioner)

  assert(cpEmpRdd.partitioner.get == dumbPartitioner)
  assert(cpDeptRdd.partitioner.get == dumbPartitioner)

  // Here we join using RDDs and ensure that the resultant rdd is partitioned so most things end up in partition 1
  val joined: RDD[(String, (Emp, Dept))] = cpEmpRdd.join(cpDeptRdd)
  val reso: Array[(Array[(String, (Emp, Dept))], Int)] = joined.glom().collect().zipWithIndex
  reso.foreach((item: Tuple2[Array[(String, (Emp, Dept))], Int]) => println(s"array size: ${item._2}. contents: ${item._1.toList}"))

  System.out.println("partitioner of RDD created by joining 2 RDD's w/ custom partitioner: " + joined.partitioner)
  assert(joined.partitioner.contains(dumbPartitioner))

  val recoveredDeptDF: DataFrame = deptPairRdd.toDF
  val recoveredEmpDF: DataFrame = empPairRdd.toDF

  System.out.println(
    "partitioner for DF recovered from custom partitioned RDD (not as expected!):" +
      recoveredDeptDF.rdd.partitioner)
  val joinedDf = recoveredEmpDF.join(recoveredDeptDF, "_1")
  println("printing results of joining the 2 dataframes we 'recovered' from the custom partitioned RDDS (looks good)")
  joinedDf.show()

  println("PRINTING partitions of joined DF does not match the glom'd results we got from underlying RDDs")
  joinedDf.rdd.glom().collect().
    zipWithIndex.foreach {
    item: Tuple2[Any, Int] =>
      val asList = item._1.asInstanceOf[Array[org.apache.spark.sql.Row]].toList
      println(s"array size: ${item._2}. contents: $asList")
  }

  assert(joinedDf.rdd.partitioner.contains(dumbPartitioner))  // this will fail ;^(
}

After several internal tweaks, I managed to add a custom partitioner to the Dataset API. It's a way far from production use, as there is a lot of cases I haven't covered. The existing partitioner implementations are hardcoded in many places, so there is a lot of code needed to be reimplemented. For example, in `ShuffleExchangeExec` the trick here is an `ExchangeCoordinator` which takes nothing but ShuffleExchangeExec and so on. Don't know if you are interested in such a code; I think it is a current limitation of Spark that may be resolved in future releases. @Chris Bedford — Gelerion, Aug 25 '19 at 07:12
@Gelerion - would love to take a look even if it is a prototype just to see what you did. on github ? — Chris Bedford, Aug 26 '19 at 00:06
Here it is: https://github.com/Gelerion/custom-partitioner/blob/master/src/main/scala/com/gelerion/spark/scala/learning/Extra.scala#L58 Just a prototype. Tell me if something is missing, I cut it down from the bigger project. Currently, I am working on a bin-packing partitioner, so I have plans devoting more time regarding the matter. — Gelerion, Aug 30 '19 at 06:38
@Gelerion - very cool. while you were learning about how to do this w/ catalyst did you find any particularly good resources that helped you understand the mechanics ? i will definitely look @ what you did ! — Chris Bedford, Aug 31 '19 at 13:41
I found the paper describing basic concepts of catalyst - https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf But mostly debugging and Jira reading. — Gelerion, Sep 04 '19 at 08:04

score 2 · Answer 1 · answered Dec 04 '19 at 12:36

Check out my new library which adds partitionBy method to the Dataset/Dataframe API level.

Taking your Emp and Dept objects as example:

class DeptByIdPartitioner extends TypedPartitioner[Dept] {
  override def getPartitionIdx(value: Dept): Int = if (value.deptId.startsWith("a")) 0 else 1
  override def numPartitions: Int = 2
  override def partitionKeys: Option[Set[PartitionKey]] = Some(Set(("deptId", StringType)))
}

class EmpByDepIdPartitioner extends TypedPartitioner[Emp] {
  override def getPartitionIdx(value: Emp): Int = if (value.deptId.startsWith("a")) 0 else 1
  override def numPartitions: Int = 2
  override def partitionKeys: Option[Set[PartitionKey]] = Some(Set(("deptId", StringType)))
}

Note that we are extending TypedPartitioner.
It is compile-time safe, you won't be able to repartition a dataset of persons with emp partitioner.

val spark = SparkBuilder.getSpark()

import org.apache.spark.sql.exchange.implicits._  //<-- addtitonal import
import spark.implicits._

val deptPartitioned = department.repartitionBy(new DeptByIdPartitioner)
val empPartitioned  = employee.repartitionBy(new EmpByDepIdPartitioner)

Let's check how our data is partitioned:

Dep dataset:
Partition N 0
    : List([a,ant dept])
Partition N 1
    : List([d,duck dept], [c,cat dept], [r,rabbit dept], [b,badger dept], [z,zebra dept], [m,mouse dept])

If we join repartitioned by the same key dataset Catalyst will properly recognize this:

val joined = deptPartitioned.join(empPartitioned, "deptId")

println("Joined:")
val result: Array[(Int, Array[Row])] = joined.rdd.glom().collect().zipWithIndex.map(_.swap)
for (elem <- result) {
  println(s"Partition N ${elem._1}")
  println(s"\t: ${elem._2.toList}")
}

Partition N 0
    : List([a,ant dept,anne])
Partition N 1
    : List([b,badger dept,bob], [c,cat dept,claire], [d,duck dept,dave], [m,mouse dept,moe], [r,rabbit dept,roy], [z,zebra dept,zelda])

score 0 · Answer 2 · answered Aug 10 '19 at 13:41

What version of Spark are you using? If it's 2.x and above, it's recommended to use Dataframe/Dataset API instead, not RDDs

It's much easier to work with the mentioned API than with RDDs, and it performs much better on later versions of Spark

You may find the link below useful for how to join DFs: How to join two dataframes in Scala and select on few columns from the dataframes by their index?

Once you get your joined DataFrame, you can use the link below for partitioning by column values, which I assume you're trying to achieve: Partition a spark dataframe based on column value?

version 2.4.3. And yes. I am onboard the Dataframe train 100% for most purposes. In rare cases (like when you need tighter control over partitioning than what the column expression based repartition api's of dataframe give you) dropping down to RDD's is worthwhile... in fact, that is what the article (python) based that i linked to does... thanks for the answer though. — Chris Bedford, Aug 10 '19 at 19:00

Technique for joining with spark dataframe w/ custom partitioner works w/ python, but not scala?

2 Answers2