0

I'm not sure why this is the behaviour, but when I apply dropDuplicates to a sorted data frame, the sorting order is disrupted. See the following two tables in comparison.

The following table is the output of sorted_df.show(), in which the sorting is in order.

+----------+-----------+
|sorted_col|another_col|
+----------+-----------+
|         1|          1|
|         8|          5|
|        15|          1|
|        19|          9|
|        20|          7|
|        27|          9|
|        67|          8|
|        91|          9|
|        91|          7|
|        91|          1|
+----------+-----------+

The following table is the output of sorted_df.dropDuplicates().show(), and the sorting is not right anymore, even though it's the same data frame.

+----------+-----------+
|sorted_col|another_col|
+----------+-----------+
|        27|          9|
|        67|          8|
|        15|          1|
|        91|          7|
|         1|          1|
|        91|          1|
|         8|          5|
|        91|          9|
|        20|          7|
|        19|          9|
+----------+-----------+

Can someone explain why this behaviour persists and how can I keep the same sorting order with dropDuplicates applied?

Apache Spark version 3.1.2

xiexieni9527
  • 111
  • 7

1 Answers1

1

dropDuplicates involves a shuffle. Ordering is therefore disrupted.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83