0

I use Spark 1.6 and am doing inner join on two dataframes as follows:

val filtergroup = metric
  .join(filtercndtns, Seq("aggrgn_filter_group_id"), inner)
  .distinct()

But I keep getting duplicate values in aggrgn_filter_group_id column. Can you please suggest some solution?

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Naveen Yadav
  • 11
  • 2
  • 8

1 Answers1

0

Spark < 2.0

Consider distinct on a dataset with the column(s) to drop duplicates on followed by an inner join on the column(s).

// don't use distinct yet
val filtergroup = metric
  .join(filtercndtns, Seq("aggrgn_filter_group_id"), "inner")

// take unique aggrgn_filter_group_ids
val uniqueFilterGroups = filtergroup
  .select("aggrgn_filter_group_id")
  .distinct

// Inner join to remove duplicates from the source dataset
filtergroup.join(uniqueFilterGroups, Seq("aggrgn_filter_group_id"), "inner")

The price is to execute an extra select with distinct and join, but should give you the expected result.

Spark >= 2.0

The following solution will only work with Spark 2.0+ that came out with support for dropDuplicates operators and allows for dropping duplicates considering only a subset of columns.

Quoting the documentation:

distinct(): Dataset[T] Returns a new Dataset that contains only the unique rows from this Dataset. This is an alias for dropDuplicates.

distinct or dropDuplicates simply drop the row duplicates comparing every column.

If you're interested in a specific column, you should use one of the dropDuplicates, e.g.

val filtergroup = metric
  .join(filtercndtns, Seq("aggrgn_filter_group_id"), "inner")
  .dropDuplicates("aggrgn_filter_group_id")

When you specify a column or a set of columns, dropDuplicates returns a new Dataset with duplicate rows removed, considering only the subset of columns.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420