I want to compare two files if not matched extra records load into another file with the unmatched records. Compare each and every fields in both file and count of records also.
Asked
Active
Viewed 1.1k times
3
-
1What is the structure of files? – Alberto Bonsanto Sep 15 '16 at 19:23
-
It's CSV format – Nathon Sep 16 '16 at 04:27
-
What are they schema? What is the comparing column? Is there any constraint? – Alberto Bonsanto Sep 16 '16 at 04:35
1 Answers
7
Let's say you have two files:
scala> val a = spark.read.option("header", "true").csv("a.csv").alias("a"); a.show
+---+-----+
|key|value|
+---+-----+
| a| b|
| b| c|
+---+-----+
a: org.apache.spark.sql.DataFrame = [key: string, value: string]
scala> val b = spark.read.option("header", "true").csv("b.csv").alias("b"); b.show
+---+-----+
|key|value|
+---+-----+
| b| c|
| c| d|
+---+-----+
b: org.apache.spark.sql.DataFrame = [key: string, value: string]
It is unclear which sort of unmatched records you are looking for, but it is easy to find them by any definition with join
:
scala> a.join(b, Seq("key")).show
+---+-----+-----+
|key|value|value|
+---+-----+-----+
| b| c| c|
+---+-----+-----+
scala> a.join(b, Seq("key"), "left_outer").show
+---+-----+-----+
|key|value|value|
+---+-----+-----+
| a| b| null|
| b| c| c|
+---+-----+-----+
scala> a.join(b, Seq("key"), "right_outer").show
+---+-----+-----+
|key|value|value|
+---+-----+-----+
| b| c| c|
| c| null| d|
+---+-----+-----+
scala> a.join(b, Seq("key"), "outer").show
+---+-----+-----+
|key|value|value|
+---+-----+-----+
| c| null| d|
| b| c| c|
| a| b| null|
+---+-----+-----+
If you are looking for the records in b.csv
that are not present in a.csv
:
scala> val diff = a.join(b, Seq("key"), "right_outer").filter($"a.value" isNull).drop($"a.value")
scala> diff.show
+---+-----+
|key|value|
+---+-----+
| c| d|
+---+-----+
scala> diff.write.csv("diff.csv")

Daniel Darabos
- 26,991
- 10
- 102
- 114