12

Which row will be kept when one uses the dropDuplicatesfunction in Spark DF? It is not stated in the Spark documentation.

  1. Keep First (according to row order)
  2. Keep Last (according to row order)
  3. Random?

P.S. assuming in a distributed YARN environment (not master local)

zero323
  • 322,348
  • 103
  • 959
  • 935
Qmage
  • 289
  • 4
  • 13

2 Answers2

13

TL;DR Keep First (according to row order)

dropDuplicates operator in Spark SQL creates a logical plan with Deduplicate operator.

That Deduplicate operator is translated to First logical operator by Spark SQL's Catalyst Optimizer which answers your question nicely (!)

You can see the Deduplicate operator in the logical plan below.

// create datasets with duplicates
val dups = spark.range(9).map(_ % 3)

val q = dups.dropDuplicates

The following is the logical plan of q dataset.

scala> println(q.queryExecution.logical.numberedTreeString)
00 Deduplicate [value#64L], false
01 +- SerializeFromObject [input[0, bigint, false] AS value#64L]
02    +- MapElements <function1>, class java.lang.Long, [StructField(value,LongType,true)], obj#63: bigint
03       +- DeserializeToObject staticinvoke(class java.lang.Long, ObjectType(class java.lang.Long), valueOf, cast(id#58L as bigint), true), obj#62: java.lang.Long
04          +- Range (0, 9, step=1, splits=Some(8))

Deduplicate operator is then translated to First logical operator (that shows itself as Aggregate operator after optimizations).

scala> println(q.queryExecution.optimizedPlan.numberedTreeString)
00 Aggregate [value#64L], [value#64L]
01 +- SerializeFromObject [input[0, bigint, false] AS value#64L]
02    +- MapElements <function1>, class java.lang.Long, [StructField(value,LongType,true)], obj#63: bigint
03       +- DeserializeToObject staticinvoke(class java.lang.Long, ObjectType(class java.lang.Long), valueOf, id#58L, true), obj#62: java.lang.Long
04          +- Range (0, 9, step=1, splits=Some(8))

After spending some time reviewing the code of Apache Spark, dropDuplicates operator is equivalent to groupBy followed by first function.

first(columnName: String, ignoreNulls: Boolean): Column Aggregate function: returns the first value of a column in a group.

import org.apache.spark.sql.functions.first
val firsts = dups.groupBy("value").agg(first("value") as "value")
scala> println(firsts.queryExecution.logical.numberedTreeString)
00 'Aggregate [value#64L], [value#64L, first('value, false) AS value#139]
01 +- SerializeFromObject [input[0, bigint, false] AS value#64L]
02    +- MapElements <function1>, class java.lang.Long, [StructField(value,LongType,true)], obj#63: bigint
03       +- DeserializeToObject staticinvoke(class java.lang.Long, ObjectType(class java.lang.Long), valueOf, cast(id#58L as bigint), true), obj#62: java.lang.Long
04          +- Range (0, 9, step=1, splits=Some(8))

scala> firsts.explain
== Physical Plan ==
*HashAggregate(keys=[value#64L], functions=[first(value#64L, false)])
+- Exchange hashpartitioning(value#64L, 200)
   +- *HashAggregate(keys=[value#64L], functions=[partial_first(value#64L, false)])
      +- *SerializeFromObject [input[0, bigint, false] AS value#64L]
         +- *MapElements <function1>, obj#63: bigint
            +- *DeserializeToObject staticinvoke(class java.lang.Long, ObjectType(class java.lang.Long), valueOf, id#58L, true), obj#62: java.lang.Long
               +- *Range (0, 9, step=1, splits=8)

I also think that dropDuplicates operator may be more performant.

Santoshi M
  • 123
  • 5
Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
  • seems like a potential performance improvement is to have an option of no-ordered/random based dropDuplicates, i.e. one without performing the first – Qmage Jun 23 '17 at 07:02
  • @Qmage I don't know if the first requires ordering. I doubt so. Thanks for spotting it. Thanks for accepting it as the answer! Appreciated. – Jacek Laskowski Jun 23 '17 at 12:49
  • @JacekLaskowski Would you have an idea of how to choose a random value instead of the first one? – belka Mar 07 '18 at 17:38
  • Why would you want it? – Jacek Laskowski Mar 07 '18 at 21:23
  • @JacekLaskowski for Machine Learning purposes, I don't want to have a bias in my dataset, so if I have lines with same key, I want to be able to chose either a combination of the two (like mean values) or a string concatenation (labels for example) or a random line. See here: https://stackoverflow.com/questions/49157052/homemade-aggregation-spark/49157616 – belka Mar 08 '18 at 07:53
  • `dropDuplicates` gives _"a new Dataset that contains only the unique rows from this Dataset. This is an alias for distinct."_ Why would that matter what rows are dropped if all are the same? I must be missing something. – Jacek Laskowski Mar 08 '18 at 11:09
  • 3
    Spark groupBy() and the first() aggregation does not preserve ordering. If this is the implementation of drop_duplicates, you shouldn't count on any ordering being preserved. – Joey Lesh Mar 28 '18 at 17:50
2

Unless you also use coalesce(1) before dropDuplicates, you may end up with an unexpected order before removing duplicates. See here for a thorough documentation of examples: https://stackoverflow.com/a/54738843/4166885

I was wondering why I sometimes ended up with a dataframe that removed the "wrong" rows. coalesce(1) solved the problem.

edit: since sometimes coalesce(1) isn't an option, my favoured solution is this one from the post above:

from pyspark.sql import Window
from pyspark.sql.functions import rank, col, monotonically_increasing_id
window = Window.partitionBy("col1").orderBy("datestr",'tiebreak')
(df_s
 .withColumn('tiebreak', monotonically_increasing_id())
 .withColumn('rank', rank().over(window))
 .filter(col('rank') == 1).drop('rank','tiebreak')
 .show()
)
Juergen
  • 312
  • 3
  • 18