How to migrate SortedSet to Seq or Array to use Spark's Dataset API with Encoders?

Question

I am trying to migrate one of my application from RDD to Dataset. The business logic highly depends on uniqueness and sorting, that is why previously we were using SortedSet.

SortedSet is basically TreeSet in Scala, and It provides O(log N) complexity for lookup, insertion, and deletion.

Unfortunately, in the current version of Spark, there is no way to use Dataset API with this collection, and the only solution is using kryo serialization which is undesirable for me in this case.

I want to find a way to bypass this problem and use Encoders in exchange for using high space complexity, but good time complexity.

Here is one of the example.

case class A(value: Long) extends AnyVal {
  def +(delta: Long): A = A(value + delta)
  def -(delta: Long): A = A(value - delta)
}

case class B (values: SortedSet[A]) {
  def +(a: A): B = B(values + a)
  def -(a: A): B = B(values - a)

  def ++(that: B): B = B(values ++ that.values)
  def --(that: B): B = B(values -- that.values)

  def lastTA: Option[A] = values.lastOption
}

This code will fail at runtime because of the Encoders. Spark allows us to keep Array or Seq inside Dataset. The solution should prevent duplication and provide sorting(insertion, deletion, etc.).

What kind of way would be the best fit?

_kryo serialization which is undesirable for me in this case_ - would you mind expanding a bit on this part? Based on what you've said so far, there is nothing wrong if Kryo serialization. It has some downsides, but so have specialized encoders. The code you've shown so far strongly suggest you're going with `Dataset[T], T ∉ Row` and if that's the case, `Kryo` `Encoder` limitations, should be perfectly acceptable. — zero323, May 01 '18 at 16:50
What the problem is to have a nested structure in business models. If it is a Dataset[B], it is okay to use kryo. However, if you use a nested structure which B is a field in another model, and so on very deeply. In this case, you cannot mix kryo serialization and built-in encoders that spark session provides. Eventually, you end up with only using kryo serialization, and that means mostly it is missing the Spark’s catalyst optimization in complex operations like join because it will be represented as binary in the background. — burak kose, May 01 '18 at 19:18
If you're going with complex "strongly" typed structure you loose most of optimizations anyway. And implementation of builtin `Seq` `Encoder` is just a nightmare (check https://stackoverflow.com/q/47293454/6910411). If you really want to go this way I would recommend flat structure with particular field using Kryo encoding. — zero323, May 01 '18 at 19:27

How to migrate SortedSet to Seq or Array to use Spark's Dataset API with Encoders?

0 Answers0