1

I want to keep the last record not the first. However the keep="last" option does not seem to work? For example on the following:

from pyspark.sql import Row
df = sc.parallelize([ \
    Row(name='Alice', age=5, height=80), \
    Row(name='Alice', age=5, height=80), \
    Row(name='Alice', age=10, height=80)]).toDF()
df.dropDuplicates().show()
+---+------+-----+
|age|height| name|
+---+------+-----+
|  5|    80|Alice|
| 10|    80|Alice|
+---+------+-----+

And I run:

df.dropDuplicates(['name', 'height']).show()
+---+------+-----+
|age|height| name|
+---+------+-----+
|  5|    80|Alice|
+---+------+-----+

I would like the following:

+---+------+-----+
|age|height| name|
+---+------+-----+
| 10|    80|Alice|
+---+------+-----+

The keep=last does not appear to be an option in pyspark?

mck
  • 40,932
  • 13
  • 35
  • 50
OSUKevin
  • 39
  • 1
  • 3

2 Answers2

2

The common way to do this sort of tasks is to calculate a rank with a suitable partitioning and ordering, and get the rows with rank = 1:

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'rank',
    F.rank().over(Window.partitionBy('name', 'height').orderBy(F.desc('age')))
).filter('rank = 1').drop('rank')

df2.show()
+-----+---+------+
| name|age|height|
+-----+---+------+
|Alice| 10|    80|
+-----+---+------+

Or another way is to use last, but it gives indeterministic results:

import pyspark.sql.functions as F

df2 = df.groupBy('name', 'height').agg(
    *[F.last(c).alias(c) for c in df.columns if c not in ['name', 'height']]
)

df2.show()
+-----+------+---+
| name|height|age|
+-----+------+---+
|Alice|    80| 10|
+-----+------+---+
mck
  • 40,932
  • 13
  • 35
  • 50
  • What if you just want the last one and it's a string? Or isnt in any particular order. This was just an example and happed to be age. Meaning I just want the last record kept regardless of any cardinal ordering. – OSUKevin Feb 04 '21 at 16:35
  • The "last" record is meaningless in Spark because Spark dataframes are unordered collections of rows. You have to order it by some column for the "last" to be well-defined. @OSUKevin – mck Feb 04 '21 at 16:36
  • @OSUKevin I added a way to use the `last` function, which might be what you wanted, but is nondeterministic. – mck Feb 04 '21 at 16:38
  • I think really it is the "name" that I need to focus on. I want to keep the last "Alice" regardless of age. Pretend that age and height are string and not really any kind of rank. I just want the last row observation of alice, if that makes sense. Can we create a system for "counting" allices and then drop all but the last one? Pretend there are other names like ben, James, etc with multiple entries and I just want the last one. – OSUKevin Feb 04 '21 at 17:45
  • @OSUKevin then you can use something similar to the last code snippet, e.g. `df2 = df.groupBy('name').agg(*[F.last(c).alias(c) for c in df.columns if c != 'name'])` – mck Feb 04 '21 at 18:00
0

GroupBy is the more idiomatic approach to this problem

from pyspark.sql import functions as F

df.groupBy(['name', 'height'])\
    .agg(F.max('age').alias('age'))
Alex Grounds
  • 256
  • 2
  • 12