3

I want to compare two files if not matched extra records load into another file with the unmatched records. Compare each and every fields in both file and count of records also.

Community
  • 1
  • 1
Nathon
  • 165
  • 1
  • 4
  • 13

1 Answers1

7

Let's say you have two files:

scala> val a = spark.read.option("header", "true").csv("a.csv").alias("a"); a.show
+---+-----+
|key|value|
+---+-----+
|  a|    b|
|  b|    c|
+---+-----+

a: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> val b = spark.read.option("header", "true").csv("b.csv").alias("b"); b.show
+---+-----+
|key|value|
+---+-----+
|  b|    c|
|  c|    d|
+---+-----+

b: org.apache.spark.sql.DataFrame = [key: string, value: string]

It is unclear which sort of unmatched records you are looking for, but it is easy to find them by any definition with join:

scala> a.join(b, Seq("key")).show
+---+-----+-----+
|key|value|value|
+---+-----+-----+
|  b|    c|    c|
+---+-----+-----+

scala> a.join(b, Seq("key"), "left_outer").show
+---+-----+-----+
|key|value|value|
+---+-----+-----+
|  a|    b| null|
|  b|    c|    c|
+---+-----+-----+

scala> a.join(b, Seq("key"), "right_outer").show
+---+-----+-----+
|key|value|value|
+---+-----+-----+
|  b|    c|    c|
|  c| null|    d|
+---+-----+-----+

scala> a.join(b, Seq("key"), "outer").show
+---+-----+-----+
|key|value|value|
+---+-----+-----+
|  c| null|    d|
|  b|    c|    c|
|  a|    b| null|
+---+-----+-----+

If you are looking for the records in b.csv that are not present in a.csv:

scala> val diff = a.join(b, Seq("key"), "right_outer").filter($"a.value" isNull).drop($"a.value")
scala> diff.show
+---+-----+
|key|value|
+---+-----+
|  c|    d|
+---+-----+

scala> diff.write.csv("diff.csv")
Daniel Darabos
  • 26,991
  • 10
  • 102
  • 114