I am trying to migrate one of my application from RDD to Dataset. The business logic highly depends on uniqueness and sorting, that is why previously we were using SortedSet.
SortedSet is basically TreeSet in Scala, and It provides O(log N) complexity for lookup, insertion, and deletion.
Unfortunately, in the current version of Spark, there is no way to use Dataset API with this collection, and the only solution is using kryo serialization which is undesirable for me in this case.
I want to find a way to bypass this problem and use Encoders in exchange for using high space complexity, but good time complexity.
Here is one of the example.
case class A(value: Long) extends AnyVal {
def +(delta: Long): A = A(value + delta)
def -(delta: Long): A = A(value - delta)
}
case class B (values: SortedSet[A]) {
def +(a: A): B = B(values + a)
def -(a: A): B = B(values - a)
def ++(that: B): B = B(values ++ that.values)
def --(that: B): B = B(values -- that.values)
def lastTA: Option[A] = values.lastOption
}
This code will fail at runtime because of the Encoders. Spark allows us to keep Array or Seq inside Dataset. The solution should prevent duplication and provide sorting(insertion, deletion, etc.).
What kind of way would be the best fit?