1

the Use Case

My data is written as dataframes and I would like to check two dataframes having the exact same schema, for equality. Specifically, to check whether for each id value, records from the first and second dataframe are identical. In other words, assume that each dataframe has one record per id and I wish to juxtapose the difference per id, between the row of dataframe one and dataframe two.

My assumption is that I need to materialize a new dataframe (i.e. via a join operation) in order to perform this at scale with Spark. Am I right so far in this assumption?

Here's the code in that vein so far:

  val postsFromDF1: Dataset[Post] = ... // dataframe read as a Dataset of Scala Objects
  val postsFromDF2: Dataset[Post] = ... // dataframe read as a Dataset of Scala Objects

  val joined: DataFrame = postsFromDF1.as("df1").join(postsFromDF2.as("df2"), usingColumn = "id")

Now I would like to list all differences between those id-matched objects that are not identical in their values (except of course, the shared id filed that they were joined by). Because some of the values are themselves collections of objects ― working with an object tree of scala objects may seem more readable or instinctive to me than switching to work at the column name level after this join. Comments so far? is this a good way to be working with Spark?

my Final Question

How can I accomplish back an object representation pair (one object per original dataframe object) for each row of the join, while still enjoying Spark's parallelism while comparing the objects?

An object representation like this:

case class PostPair(post: Post, otherPost: Post, id: String)

What I Tried

I tried hammering this experimental code, but it fails at runtime; probably the Encoders.product implicit is not sufficiently descriptive.

  case class PostPair(post: Post, otherPost: Post, id: String)
  implicit val encoder = Encoders.product[PostPair]

  val joined: Dataset[PostPair] =
    postsFromDF1.as("df1")
      .join(postsFromDF2.as("df2"), usingColumn = "id")
      .as[PostPair]

Additional Information

Here's how I accomplish a collection of case classes from each dataframe in separation:

case class PostsParquetReader(spark: SparkSession) {
  /** default method applied when the object is called */
  def apply(path: String) = {
    val df = spark.read.parquet(path)
    toCaseClass(spark, df)
  }

  /** applies the secret sauce for coercing to a case class that is implemented by spark's flatMap */
  private def toCaseClass(spark : SparkSession, idf : DataFrame)  = {
    import spark.implicits._
    idf.as[Post].flatMap(record => {
      Iterator[Post](record)
    })
  }
}

I feel that using the same object coercion approach after the join might be just cumbersome, or perhaps this object coercion approach has its drawbacks in terms of Spark parallelism / distributed execution to begin with.

On the other hand, working (coding) the comparison and display of the differences via object records as if the data were simple Scala object trees seems like the most readable and flexible approach ― as it enables the standard leverage of the Scala collections API.

Salim
  • 2,046
  • 12
  • 13
matanster
  • 15,072
  • 19
  • 88
  • 167
  • Does this answer your question? [Perform a typed join in Scala with Spark Datasets](https://stackoverflow.com/questions/40605167/perform-a-typed-join-in-scala-with-spark-datasets) – user10938362 Dec 23 '19 at 10:09
  • Thanks you. On first and second read, What I I mainly learn from there is that the Spark and object-oriented paradigms are at an impedance mismatch in how Spark is designed to be used. Would you concur? I will go through all that's described there more. – matanster Dec 23 '19 at 10:33
  • I would not regard any answers to that question as a direct (nor comprehensive) answer here. My question is also slightly more explicitly broader in whether working with objects plays well with Spark's architecture of concurrency and distribution, begging a hopefully _comprehensive_ answer to its specific use case. – matanster Dec 23 '19 at 10:53
  • what exactly output/schema of the result data frame do you want? I feel you can use `except` function to reach the goal, but maybe you want to generate a specific new data frame – Binzi Cao Dec 23 '19 at 11:08
  • @BinziCao I would like to use something like this as the result type: `case class PostPair(post: Post, otherPost: Post, id: String)` – matanster Dec 23 '19 at 11:12

3 Answers3

1

The solutions by @Binzi would work and needs some refinement.

@matanster Your approach is good, you can use DataSet API instead of DataFrame API. DataSet API is backed by a Scala case-class and make complex operations easier. The scalability is same but performance is little less as compared to DataFrame. I always prefer DataSet for complex operations.

You can create dataset directly from raw data and needn't write toCaseClass(spark, df) by yourself. The case class schema must match your data schema.

    case class post ()//define all properties

    val spark : SparkSession = SparkSession.builder
      .appName("name")
      .master("local[2]")
      .getOrCreate()

    import spark.implicits._
    val postsFromDF1: Dataset[Post] = spark.read.parquet(path).as[post]
    val postsFromDF2: Dataset[Post] = spark.read.parquet(path).as[post]

    val joinedDs = postsFromDF1.joinWith(postsFromDF2)

joinedDs is a tuple(post,post). Then you can apply logic on this tuple and post object within it (as you figured).

Performance

Datset needs to decode the whole data into the object before it can be operated. Essentially it can't benefit from columnar storage. However in Dataframe you can read just few columns and operate on them, since parquet is columnar it saves a lot of time to avoid reading all those columns. Other than this I don't any other performance difference. The scalablilty is exactly same.

Salim
  • 2,046
  • 12
  • 13
0

this seems what you want:

  import org.apache.spark.sql.functions._
  case class Post(id:Int, name: String, age: Long)
  case class PostPair(post: Post, otherPost: Post, id: String)

  val tom = Post(1,"Tom",37)
  val sam = Post(2, "Sam",40)
  val sam2 = Post(2, "Sam",41)

  val postsFromDF1 = List(tom, sam).toDS
  val postsFromDF2 = List(tom, sam2).toDS

  val columns  = struct(postsFromDF1.columns.map(col(_)):_*)

  val result = postsFromDF1.except(postsFromDF2).
    select(
      columns.alias("post"),
      col("id")
    ).
    join(
      postsFromDF2.select(
        columns.alias("otherPost"),
        col("id")
      ),
      "id"
    ).as[PostPair]


  result.show()

+---+------------+------------+
| id|        post|   otherPost|
+---+------------+------------+
|  2|[2, Sam, 40]|[2, Sam, 41]|
+---+------------+------------+
Binzi Cao
  • 1,075
  • 5
  • 14
  • Would you expect any fundamental difference in scalability as comparing to [my answer](https://stackoverflow.com/a/59456203/1509695)? – matanster Dec 23 '19 at 13:46
  • @matanster based on your original requirements, 1. using `except` will filter out the records which are not matching between the 2, so there is no further `fitler`. 2. The `filter` code is subject to any schema update, while `except` handles any schema update. 3. the data frame generated via `except` would be smaller than the `joined` one.You can potentially cache it – Binzi Cao Dec 24 '19 at 00:08
0

I found joinWith which simply preserves object semantics for the client code without requiring any additional user code being necessary. This Spark API function simply takes care of seamlessly enabling the use of the original object types, at least for the case of a simple join. In my code example, it simply yields a Dataset[(Post, Post)].

val joined =
  postsFromDF1.joinWith(
    postsFromDF2,
    postsFromDF1.col("id") === postsFromDataframe2.col("id"))

The result is a collection of 2-tuples, and can be used accordingly, e.g.

joined.filter( pair => pair._1.someField != pair._2.someField )

Any comments as to scalability and memory consumption v.s. other solutions are welcome.

matanster
  • 15,072
  • 19
  • 88
  • 167
  • I use dataset a lot and I don't any issue with scalability. The performance can be little lower as compared to dataframe but I will always go with dataset – Salim Dec 23 '19 at 17:02