34

I like Spark Datasets as they give me analysis errors and syntax errors at compile time and also allow me to work with getters instead of hard-coded names/numbers. Most computations can be accomplished with Dataset’s high-level APIs. For example, it’s much simpler to perform agg, select, sum, avg, map, filter, or groupBy operations by accessing a Dataset typed object’s than using RDD rows’ data fields.

However the join operation is missing from this, I read that I can do a join like this

ds1.joinWith(ds2, ds1.toDF().col("key") === ds2.toDF().col("key"), "inner")

But that is not what I want as I would prefer to do it via the case class interface, so something more like this

ds1.joinWith(ds2, ds1.key === ds2.key, "inner")

The best alternative for now seems to create an object next to the case class and give this functions to provide me with the right column name as a String. So I would use the first line of code but put a function instead of a hard-coded column name. But that doesn't feel elegant enough..

Can someone advise me on other options here? The goal is to have an abstraction from the actual column names and work preferably via the getters of the case class.

I'm using Spark 1.6.1 and Scala 2.10

Community
  • 1
  • 1
Sparky
  • 717
  • 1
  • 7
  • 17

2 Answers2

36

Observation

Spark SQL can optimize join only if join condition is based on the equality operator. This means we can consider equijoins and non-equijoins separately.

Equijoin

Equijoin can be implemented in a type safe manner by mapping both Datasets to (key, value) tuples, performing join based on keys, and reshaping the result:

import org.apache.spark.sql.Encoder
import org.apache.spark.sql.Dataset

def safeEquiJoin[T, U, K](ds1: Dataset[T], ds2: Dataset[U])
    (f: T => K, g: U => K)
    (implicit e1: Encoder[(K, T)], e2: Encoder[(K, U)], e3: Encoder[(T, U)]) = {
  val ds1_ = ds1.map(x => (f(x), x))
  val ds2_ = ds2.map(x => (g(x), x))
  ds1_.joinWith(ds2_, ds1_("_1") === ds2_("_1")).map(x => (x._1._2, x._2._2))
}

Non-equijoin

Can be expressed using relational algebra operators as R ⋈θ S = σθ(R × S) and converted directly to code.

Spark 2.0

Enable crossJoin and use joinWith with trivially equal predicate:

spark.conf.set("spark.sql.crossJoin.enabled", true)

def safeNonEquiJoin[T, U](ds1: Dataset[T], ds2: Dataset[U])
                         (p: (T, U) => Boolean) = {
  ds1.joinWith(ds2, lit(true)).filter(p.tupled)
}

Spark 2.1

Use crossJoin method:

def safeNonEquiJoin[T, U](ds1: Dataset[T], ds2: Dataset[U])
    (p: (T, U) => Boolean)
    (implicit e1: Encoder[Tuple1[T]], e2: Encoder[Tuple1[U]], e3: Encoder[(T, U)]) = {
  ds1.map(Tuple1(_)).crossJoin(ds2.map(Tuple1(_))).as[(T, U)].filter(p.tupled)
}

Examples

case class LabeledPoint(label: String, x: Double, y: Double)
case class Category(id: Long, name: String)

val points1 = Seq(LabeledPoint("foo", 1.0, 2.0)).toDS
val points2 = Seq(
  LabeledPoint("bar", 3.0, 5.6), LabeledPoint("foo", -1.0, 3.0)
).toDS
val categories = Seq(Category(1, "foo"), Category(2, "bar")).toDS

safeEquiJoin(points1, categories)(_.label, _.name)
safeNonEquiJoin(points1, points2)(_.x > _.x)

Notes

  • It should be noted that these methods are qualtiatively differnt from a direct joinWith application and require expensive DeserializeToObject / SerializeFromObject transformations (compared to that direct joinWith can use logical operations on the data).

    This is similar to the behavior described in Spark 2.0 Dataset vs DataFrame.

  • If you're not limited to the Spark SQL API frameless provides interesting type safe extensions for Datasets (as of today its supports only Spark 2.0):

    import frameless.TypedDataset
    
    val typedPoints1 = TypedDataset.create(points1)
    val typedPoints2 = TypedDataset.create(points2)
    
    typedPoints1.join(typedPoints2, typedPoints1('x), typedPoints2('x))
    
  • Dataset API is not stable in 1.6 so I don't think it makes sense to use it there.

  • Of course this design and descriptive names are not necessary. You can easily use type class to add this methods implicitly to Dataset an there is no conflict with built in signatures so both can be called joinWith.

Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935
  • 1
    The `safeEquiJoin` example just reinforces the absence of an out-of-box way to do a fully type-safe join, by encasing an invocation of `joinWith` that specifies tuple members in quotes (`"_1"`) in a nice wrapper whose implementation tends toward the crufty end of the spectrum. – nclark Dec 08 '17 at 19:18
  • 3
    @nclark While I agree with overall sentiment, you have to be keep in mind that `Dataset` API is not type-safe at all. It just an abstraction over effectively untyped container and native memory access. At the point you call `as[T]`, it is as good as depending on `asInsnanceOf` and matching fields by name. If you're looking for end-to-end type safety then `RDD` API is still irreplaceable. There are more elegant implementations of "type-safe" join, but as unsatisfying as it is, at the end of the day they'll do the same thing (match names) and hope for the best. – zero323 Dec 18 '17 at 17:38
2

Also, another bigger problem for not type safe Spark API is that when you join two Datasets, it will give you a DataFrame. And then you lose types from your original two datasets.

val a: Dataset[A]
val b: Dataset[B]

val joined: Dataframe = a.join(b)
// what would be great is 
val joined: Dataset[C] = a.join(b)(implicit func: (A, B) => C)
Community
  • 1
  • 1
linehrr
  • 1,668
  • 19
  • 24
  • Not entirely crystal clear what the code snippet is demonstrating or pointing at actually – matanster Dec 23 '19 at 10:40
  • 1
    @matanster aka, the outcome type of the join could then be tested during the compile time. e.g. if outcome is out: Dataset[C], calling out.map(_.fieldA => ...) when type C does not have fieldA will fail during compiling, not need to explode during runtime. – linehrr Dec 25 '19 at 01:20
  • 2
    @matanster It basically means `join` is an untyped transformation in `Dataset`. – jack Sep 25 '20 at 00:02