spark dropDuplicates based on json array field

Question

I have json files of the following structure:

{"names":[{"name":"John","lastName":"Doe"},
{"name":"John","lastName":"Marcus"},
{"name":"David","lastName":"Luis"}
]}

I want to read several such json files and distinct them based on the "name" column inside names. I tried

df.dropDuplicates(Array("names.name"))

but it didn't do the magic.

After you create a dataframe from the json files, your question becomes a duplicate: https://stackoverflow.com/questions/30248221/removing-duplicates-from-rows-based-on-specific-columns-in-an-rdd-spark-datafram — Remy Kabel, Jun 12 '17 at 17:55

score 1 · Answer 1 · answered Jun 12 '17 at 17:56

1

This seems to be a regression that was added in spark 2.0. If you bring the nested column to the highest level you can drop the duplicates. If we create a new column based on the columns you want to dedup on. Then we drop the columns and finally drop the column. The following function will work for composite keys as well.

val columns = Seq("names.name")
df.withColumn("DEDUP_KEY", concat_ws(",", columns:_*))
  .dropDuplicates("DEDUP_KEY")
  .drop("DEDUP_KEY")

answered Jun 12 '17 at 17:56

Alex Naspo

2,052
1
20
37

not sure why will this work since it seems that the DEDUP_KEY column will contain the names seperated by a comma, so .dropDuplicates("DEDUP_KEY) won't work correctly, isn't it? – Hagai Jun 12 '17 at 18:14
No, the comma delimited is when you have more than one key you want to dedup on (composite key). in your case it will be an additional high level key with the name so you can dedup. Did you give it a try? – Alex Naspo Jun 12 '17 at 18:27
yes I have. I looked at the results using .show() and It seems to create a DEDUP_KEY column with the names seperated by a comma. then dropDuplicates doesn't work as expected. – Hagai Jun 12 '17 at 18:44
yes, I am sorry. You would first have to explode on names, then dedup. My apologies. – Alex Naspo Jun 12 '17 at 18:46
thanks. I replied with what seems to be the solution using explode. – Hagai Jun 12 '17 at 19:08

score 0 · Answer 2 · answered Jun 12 '17 at 19:10

just for future reference, the solution looks like

      val uniqueNams = allNames.withColumn("DEDUP_NAME_KEY", 
org.apache.spark.sql.functions.explode(new Column("names.name")))
.cache()
.dropDuplicates(Array("DEDUP_NAME_KEY"))
.drop("DEDUP_NAME_KEY")

score 0 · Answer 3 · answered Mar 30 '23 at 14:03

As an update to existing answer, similar thing can be achieved without explode. We can simply get value of each column and then do the concatenation for generating DEDUPE_KEY

val columns = Seq("names.name")
df.withColumn("DEDUPE_KEY", concat_ws("_", columns.map(att => col(att)):_*))
  .dropDuplicates("DEDUPE_KEY")
  .drop("DEDUPE_KEY")

spark dropDuplicates based on json array field

3 Answers3

Linked

Related