the Use Case
My data is written as dataframes and I would like to check two dataframes having the exact same schema, for equality. Specifically, to check whether for each id value, records from the first and second dataframe are identical. In other words, assume that each dataframe has one record per id and I wish to juxtapose the difference per id, between the row of dataframe one and dataframe two.
My assumption is that I need to materialize a new dataframe (i.e. via a join operation) in order to perform this at scale with Spark. Am I right so far in this assumption?
Here's the code in that vein so far:
val postsFromDF1: Dataset[Post] = ... // dataframe read as a Dataset of Scala Objects
val postsFromDF2: Dataset[Post] = ... // dataframe read as a Dataset of Scala Objects
val joined: DataFrame = postsFromDF1.as("df1").join(postsFromDF2.as("df2"), usingColumn = "id")
Now I would like to list all differences between those id-matched objects that are not identical in their values (except of course, the shared id filed that they were joined by). Because some of the values are themselves collections of objects ― working with an object tree of scala objects may seem more readable or instinctive to me than switching to work at the column name level after this join. Comments so far? is this a good way to be working with Spark?
my Final Question
How can I accomplish back an object representation pair (one object per original dataframe object) for each row of the join, while still enjoying Spark's parallelism while comparing the objects?
An object representation like this:
case class PostPair(post: Post, otherPost: Post, id: String)
What I Tried
I tried hammering this experimental code, but it fails at runtime; probably the Encoders.product
implicit is not sufficiently descriptive.
case class PostPair(post: Post, otherPost: Post, id: String)
implicit val encoder = Encoders.product[PostPair]
val joined: Dataset[PostPair] =
postsFromDF1.as("df1")
.join(postsFromDF2.as("df2"), usingColumn = "id")
.as[PostPair]
Additional Information
Here's how I accomplish a collection of case classes from each dataframe in separation:
case class PostsParquetReader(spark: SparkSession) {
/** default method applied when the object is called */
def apply(path: String) = {
val df = spark.read.parquet(path)
toCaseClass(spark, df)
}
/** applies the secret sauce for coercing to a case class that is implemented by spark's flatMap */
private def toCaseClass(spark : SparkSession, idf : DataFrame) = {
import spark.implicits._
idf.as[Post].flatMap(record => {
Iterator[Post](record)
})
}
}
I feel that using the same object coercion approach after the join might be just cumbersome, or perhaps this object coercion approach has its drawbacks in terms of Spark parallelism / distributed execution to begin with.
On the other hand, working (coding) the comparison and display of the differences via object records as if the data were simple Scala object trees seems like the most readable and flexible approach ― as it enables the standard leverage of the Scala collections API.