pyspark dataframe drop duplicate values with older time stamp

Question

I have a pyspark dataframe that has starttime and stoptime columns with additional columns whose values get updated

|startime  |stoptime  |hour  |minute  |sec  |sip          |dip            |sport|dport|proto|pkt |byt |
|1504766585|1504801216|16    |20      |16   |192.168.0.11 |23.204.108.58  |51249|80   |6    |0   |0   |
|1504766585|1504801216|16    |20      |16   |192.168.0.11 |23.204.108.58  |51249|80   |6    |0   |0   |
|1504781751|1504801216|16    |20      |16   |192.168.0.11 |23.72.38.96    |51252|80   |6    |0   |0   |
|1504781751|1504801216|16    |20      |16   |192.168.0.11 |23.72.38.96    |51252|80   |6    |0   |0   |
|1504766585|1504801336|16    |22      |16   |192.168.0.11 |23.204.108.58  |51249|80   |6    |0   |0   |
|1504766585|1504801336|16    |22      |16   |192.168.0.11 |23.204.108.58  |51249|80   |6    |0   |0   |
|1504781751|1504801336|16    |22      |16   |192.168.0.11 |23.72.38.96    |51252|80   |6    |0   |0   |
|1504781751|1504801336|16    |22      |16   |192.168.0.11 |23.72.38.96    |51252|80   |6    |0   |0   |

In this example I want to select all rows with latest stoptime, all the other column values are duplicates.

MaFF · Accepted Answer · 2017-09-09T09:16:28.237

I'm gessing you want to keep the latest record for every sport. You should use a window function to determine the latest record for each partition:

import pyspark.sql.functions as psf
from pyspark.sql import Window
w = Window.partitionBy("sport").orderBy(psf.desc("stoptime"))

df.withColumn("rn", psf.row_number().over(w)).filter("rn = 1").drop("rn")

    +----------+----------+----+---+---+------------+-------------+-----+-----+-----+---+---+
    |  startime|  stoptime|hour|min|sec|         sip|          dip|sport|dport|proto|pkt|byt|
    +----------+----------+----+---+---+------------+-------------+-----+-----+-----+---+---+
    |1504781751|1504801336|  16| 22| 16|192.168.0.11|  23.72.38.96|51252|   80|    6|  0|  0|
    |1504766585|1504801336|  16| 22| 16|192.168.0.11|23.204.108.58|51249|   80|    6|  0|  0|
    +----------+----------+----+---+---+------------+-------------+-----+-----+-----+---+---+

You'll end up with as many records as there are distinct partitions for sport.

If you want the latest stoptime for the whole table without partitioning you can remove the partitionBy and use dense_rank instead (same values will have the same rank):

w = Window.orderBy(psf.desc("stoptime"))

df.withColumn("rn", psf.dense_rank().over(w)).filter("rn = 1").drop("rn").show()

    +----------+----------+----+---+---+------------+-------------+-----+-----+-----+---+---+
    |  startime|  stoptime|hour|min|sec|         sip|          dip|sport|dport|proto|pkt|byt|
    +----------+----------+----+---+---+------------+-------------+-----+-----+-----+---+---+
    |1504766585|1504801336|  16| 22| 16|192.168.0.11|23.204.108.58|51249|   80|    6|  0|  0|
    |1504766585|1504801336|  16| 22| 16|192.168.0.11|23.204.108.58|51249|   80|    6|  0|  0|
    |1504781751|1504801336|  16| 22| 16|192.168.0.11|  23.72.38.96|51252|   80|    6|  0|  0|
    |1504781751|1504801336|  16| 22| 16|192.168.0.11|  23.72.38.96|51252|   80|    6|  0|  0|
    +----------+----------+----+---+---+------------+-------------+-----+-----+-----+---+---+

I wanted rows with same "startime, hour, min, sec, sip, dip, sport, dport, proto" but with the latest 'stoptime'. That in turn will give me the max byt and pkt — user2825083, Sep 11 '17 at 00:13
If you need the maximum `stoptime` value for each distinct `startime, hour, min, sec, sip, dip, sport, dport, proto`. you just have to but all of these columns in the `partitionBy` — MaFF, Sep 11 '17 at 05:53
Does this look right? `import pyspark.sql.functions as psf from pyspark.sql import Window w = Window.partitionBy("startime","hour","sip","dip","sport","dport","proto").orderBy(psf.desc("stoptime")) df = dataframe.withColumn("rn", psf.row_number().over(w)).filter("rn = 1").drop("rn") df.show()` — user2825083, Sep 11 '17 at 07:30
It does look right, do check out the output to see if it matches your expectations. — MaFF, Sep 11 '17 at 18:23

score 0 · Answer 2 · edited Nov 22 '21 at 12:47

0

from pyspark.sql.functions import col

df = sc.parallelize([(1504766585,1504801216,16,20,16,'192.168.0.11','23.204.108.58',51249,80,6,0,0),
                     (1504766585,1504801216,16,20,16,'192.168.0.11','23.204.108.58',51249,80,6,0,0),
                     (1504781751,1504801216,16,20,16,'192.168.0.11','23.72.38.96',  51252,80,6,0,0),
                     (1504781751,1504801216,16,20,16,'192.168.0.11','23.72.38.96',  51252,80,6,0,0),
                     (1504766585,1504801336,16,22,16,'192.168.0.11','23.204.108.58',51249,80,6,0,0),
                     (1504766585,1504801336,16,22,16,'192.168.0.11','23.204.108.58',51249,80,6,0,0),
                     (1504781751,1504801336,16,22,16,'192.168.0.11','23.72.38.96',  51252,80,6,0,0),
                     (1504781751,1504801336,16,22,16,'192.168.0.11','23.72.38.96',  51252,80,6,0,0)]).\
    toDF(["startime","stoptime","hour","min","sec","sip","dip","sport","dport","proto","pkt","byt"])

df1 = df.where(col("stoptime") == df.select("stoptime").rdd.max()[0]).distinct()
df1.show()

Output is

+----------+----------+----+---+---+------------+-------------+-----+-----+-----+---+---+
|  startime|  stoptime|hour|min|sec|         sip|          dip|sport|dport|proto|pkt|byt|
+----------+----------+----+---+---+------------+-------------+-----+-----+-----+---+---+
|1504766585|1504801336|  16| 22| 16|192.168.0.11|23.204.108.58|51249|   80|    6|  0|  0|
|1504781751|1504801336|  16| 22| 16|192.168.0.11|  23.72.38.96|51252|   80|    6|  0|  0|
+----------+----------+----+---+---+------------+-------------+-----+-----+-----+---+---+

edited Nov 22 '21 at 12:47

Nimantha

6,405
6
28
69

answered Sep 07 '17 at 18:58

Prem

11,775
1
19
33

Just added (Python) code highlighting - look for in the beginning of your code... – desertnaut Sep 07 '17 at 20:45
@user2825083 you should mark it as the correct answer if it answered your query as it'll help others in case they face similar issue in future. Thanks! – Prem Sep 08 '17 at 05:54
collecting pyspark values into python objects can have unexpected results for `float` since they are not approximated the same way. I would not recommend doing it. – MaFF Sep 09 '17 at 09:12
@Marie care to explain more if you don't mind? I couldn't find any non-spark objects used in here? – Prem Sep 09 '17 at 14:45
`df.select("stoptime").rdd.max()[0]` is a python `int` – MaFF Sep 09 '17 at 16:31
You're also converting it to rdd for aggregation which is less efficient than the dataframe aggregation – MaFF Sep 09 '17 at 16:33
@Marie I think we should always talk sensible and be ready to defend any theoretical assumption before posting something (e.g. first you talk about float then int then rdd vs df). Shouldn't we concentrate on helping community members rather than this sort of dialogue exchanges? (BTW - this [benchmark testing](https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html) might interest you). Pls don't intimidate OP with so many comments. Be a sport!!! – Prem Sep 10 '17 at 12:23
oh no no, I guess messages can be misleading... It wasn't my intention to come off as demeaning, I was also just trying to help. I spent time trying to find a reproducible example for `float`, it's a problem I had to solve for a colleague of mine and I couldn't reproduce it. I also ran into this thread https://stackoverflow.com/questions/46122846/pyspark-inconsistency-in-converting-timestamp-to-integer-in-dataframe a couple of days ago so I knew other people ran into this conversion problem. I was just advising not to go back and forth between python and spark if it can be avoided. Sorry again – MaFF Sep 10 '17 at 14:01
Thank you for the suggestions. Will this also work? `from pyspark.sql.functions import col, max as max_ dataframe.groupBy(dataframe.trhour, dataframe.trminute, dataframe.trsec, dataframe.starttime, dataframe.sip, dataframe.dip, dataframe.sport, dataframe.dport, dataframe.proto, dataframe.ipkt, dataframe.ibyt).agg(max_('stoptime'))` – user2825083 Sep 11 '17 at 00:05
@user2825083 This should work but wouldn't it give maximum value rows in 'each' group? – Prem Sep 11 '17 at 05:51
@Prem This solution will give only rows with max stoptime. – user2825083 Sep 11 '17 at 07:35
But in your original post you sounded differently. It looks like a case of wrong requirement gathering :) - I thought that you only wanted to have the rows having latest 'stoptime' (irrespective of group!!!). I would suggest to update the question properly with expected output so that other users don't get confused while looking for an answer to their problem. Thanks! – Prem Sep 11 '17 at 07:52

pyspark dataframe drop duplicate values with older time stamp

2 Answers2

Linked