Spark < 2.0
Consider distinct
on a dataset with the column(s) to drop duplicates on followed by an inner join on the column(s).
// don't use distinct yet
val filtergroup = metric
.join(filtercndtns, Seq("aggrgn_filter_group_id"), "inner")
// take unique aggrgn_filter_group_ids
val uniqueFilterGroups = filtergroup
.select("aggrgn_filter_group_id")
.distinct
// Inner join to remove duplicates from the source dataset
filtergroup.join(uniqueFilterGroups, Seq("aggrgn_filter_group_id"), "inner")
The price is to execute an extra select
with distinct
and join
, but should give you the expected result.
Spark >= 2.0
The following solution will only work with Spark 2.0+ that came out with support for dropDuplicates
operators and allows for dropping duplicates considering only a subset of columns.
Quoting the documentation:
distinct(): Dataset[T]
Returns a new Dataset that contains only the unique rows from this Dataset. This is an alias for dropDuplicates
.
distinct
or dropDuplicates
simply drop the row duplicates comparing every column.
If you're interested in a specific column, you should use one of the dropDuplicates
, e.g.
val filtergroup = metric
.join(filtercndtns, Seq("aggrgn_filter_group_id"), "inner")
.dropDuplicates("aggrgn_filter_group_id")
When you specify a column or a set of columns, dropDuplicates
returns a new Dataset with duplicate rows removed, considering only the subset of columns.