16

I have this spark DataFrame:

+---+-----+------+----+------------+------------+
| ID|  ID2|Number|Name|Opening_Hour|Closing_Hour|
+---+-----+------+----+------------+------------+
|ALT|  QWA|     6|null|    08:59:00|    23:30:00|
|ALT|AUTRE|     2|null|    08:58:00|    23:29:00|
|TDR|  QWA|     3|null|    08:57:00|    23:28:00|
|ALT| TEST|     4|null|    08:56:00|    23:27:00|
|ALT|  QWA|     6|null|    08:55:00|    23:26:00|
|ALT|  QWA|     2|null|    08:54:00|    23:25:00|
|ALT|  QWA|     2|null|    08:53:00|    23:24:00|
+---+-----+------+----+------------+------------+

I want to get a new dataframe with only the lines that are not unique regarding the 3 fields "ID", "ID2" and "Number".

It means that I want this DataFrame:

+---+-----+------+----+------------+------------+
| ID|  ID2|Number|Name|Opening_Hour|Closing_Hour|
+---+-----+------+----+------------+------------+
|ALT|  QWA|     6|null|    08:59:00|    23:30:00|
|ALT|  QWA|     2|null|    08:53:00|    23:24:00|
+---+-----+------+----+------------+------------+

Or maybe a dataframe with all the duplicates:

+---+-----+------+----+------------+------------+
| ID|  ID2|Number|Name|Opening_Hour|Closing_Hour|
+---+-----+------+----+------------+------------+
|ALT|  QWA|     6|null|    08:59:00|    23:30:00|
|ALT|  QWA|     6|null|    08:55:00|    23:26:00|
|ALT|  QWA|     2|null|    08:54:00|    23:25:00|
|ALT|  QWA|     2|null|    08:53:00|    23:24:00|
+---+-----+------+----+------------+------------+
pault
  • 41,343
  • 15
  • 107
  • 149
Anneso
  • 583
  • 2
  • 11
  • 20

3 Answers3

33

One way to do this is by using a pyspark.sql.Window to add a column that counts the number of duplicates for each row's ("ID", "ID2", "Number") combination. Then select only the rows where the number of duplicate is greater than 1.

import pyspark.sql.functions as f
from pyspark.sql import Window

w = Window.partitionBy('ID', 'ID2', 'Number')
df.select('*', f.count('ID').over(w).alias('dupeCount'))\
    .where('dupeCount > 1')\
    .drop('dupeCount')\
    .show()
#+---+---+------+----+------------+------------+
#| ID|ID2|Number|Name|Opening_Hour|Closing_Hour|
#+---+---+------+----+------------+------------+
#|ALT|QWA|     2|null|    08:54:00|    23:25:00|
#|ALT|QWA|     2|null|    08:53:00|    23:24:00|
#|ALT|QWA|     6|null|    08:59:00|    23:30:00|
#|ALT|QWA|     6|null|    08:55:00|    23:26:00|
#+---+---+------+----+------------+------------+

I used pyspark.sql.functions.count() to count the number of items in each group. This returns a DataFrame containing all of the duplicates (the second output you showed).

If you wanted to get only one row per ("ID", "ID2", "Number") combination, you could do using another Window to order the rows.

For example, below I add another column for the row_number and select only the rows where the duplicate count is greater than 1 and the row number is equal to 1. This guarantees one row per grouping.

w2 = Window.partitionBy('ID', 'ID2', 'Number').orderBy('ID', 'ID2', 'Number')
df.select(
        '*',
        f.count('ID').over(w).alias('dupeCount'),
        f.row_number().over(w2).alias('rowNum')
    )\
    .where('(dupeCount > 1) AND (rowNum = 1)')\
    .drop('dupeCount', 'rowNum')\
    .show()
#+---+---+------+----+------------+------------+
#| ID|ID2|Number|Name|Opening_Hour|Closing_Hour|
#+---+---+------+----+------------+------------+
#|ALT|QWA|     2|null|    08:54:00|    23:25:00|
#|ALT|QWA|     6|null|    08:59:00|    23:30:00|
#+---+---+------+----+------------+------------+
Bebeerna
  • 87
  • 1
  • 6
pault
  • 41,343
  • 15
  • 107
  • 149
17

Here is a way to do it without Window.

A DataFrame with the duplicates

df.exceptAll(df.drop_duplicates(['ID', 'ID2', 'Number'])).show()
# +---+---+------+------------+------------+
# | ID|ID2|Number|Opening_Hour|Closing_Hour|
# +---+---+------+------------+------------+
# |ALT|QWA|     2|    08:53:00|    23:24:00|
# |ALT|QWA|     6|    08:55:00|    23:26:00|
# +---+---+------+------------+------------+

A DataFrame with all duplicates (using left_anti join)

df.join(df.groupBy('ID', 'ID2', 'Number')\
          .count().where('count = 1').drop('count'),
        on=['ID', 'ID2', 'Number'],
        how='left_anti').show()
# +---+---+------+------------+------------+
# | ID|ID2|Number|Opening_Hour|Closing_Hour|
# +---+---+------+------------+------------+
# |ALT|QWA|     2|    08:54:00|    23:25:00|
# |ALT|QWA|     2|    08:53:00|    23:24:00|
# |ALT|QWA|     6|    08:59:00|    23:30:00|
# |ALT|QWA|     6|    08:55:00|    23:26:00|
# +---+---+------+------------+------------+
10

To extend on pault's really great answer: I often need to subset a dataframe to only entries that are repeated x times, and since I need to do this really often, I turned this into a function that I just import with lots of other helper functions in the beginning of my scripts:

import pyspark.sql.functions as f
from pyspark.sql import Window
def get_entries_with_frequency(df, cols, num):
  """
  This function will filter the dataframe df down to all the rows that
  have the same values in cols num times. Example: If num=3, col="cartype", 
  then the function will only return rows where a certain cartype occurs exactly 3 times
  in the dataset. If col "cartype" contains the following:
  ["Mazda", "Seat", "Seat", "VW", "Mercedes", "VW", "VW", "Mercedes", "Seat"],
  then the function will only return rows containing "VW" or "Seat" 
  since these occur exactly 3 times.

  df: Pyspark dataframe
  cols: Either string column name or list of strings for multiple columns.
  num: int - The exact number of times a value (or combination of values,
       if cols is a list) has to appear in df.
  """
  if type(cols)==str:
    cols = [cols]
  w = Window.partitionBy(cols)
  return df.select('*', f.count(cols[0]).over(w).alias('dupeCount'))\
           .where("dupeCount = {}".format(num))\
           .drop('dupeCount')
Thomas
  • 4,696
  • 5
  • 36
  • 71
  • 1
    Just think how awesome this answer would be if `get_entries_with_frequency` had docstrings for the parameters, so somebody besides @Thomas could use it. – pauljohn32 Feb 25 '22 at 05:46
  • @pauljohn32 it has been a while since I wrote this, and I find it kind of hard to explain what exactly it does, but I did my best ;-) – Thomas Feb 25 '22 at 14:14