PySpark grouping by 2 columns to get top 1 count per group

Asked Jun 04 '18 at 17:39

Active Jun 04 '18 at 18:25

Viewed 48 times

I have a PySpark dataframe mydf and am grouping by 2 columns (code and col1) to obtain a resultant table with highest distinct count of third column (newid).

Eg: mydf

code   col1   newid
100     MNO      1
100     MNO      2
230     LLL      3
245     TTE      4
230     LLL      5
230     LIO      6
100     FGH      7

Expected Result:

code   col1   count(distinct newid)
100     MNO      2
230     LLL      2
245     TTE      1

Current results using the code below:

mydf.groupBy("code","col1").agg(count("newid"), countDistinct("newid")\
    .orderBy(desc("countDistinct('newid')")))

code   col1   newid
100     MNO      2
230     LLL      2
245     TTE      1
100     FGH      1
230     LIO      1

edited Jun 04 '18 at 18:25

pault

41,343
15
107
149

asked Jun 04 '18 at 17:39

thecoder

adding drop_duplicated for your result df – BENY Jun 04 '18 at 17:51
@Wen dropDuplicates doesn't take any arguments in PySpark? Eventually after Order By I want to keep only the first occurence.. – thecoder Jun 04 '18 at 18:09
1

Like mydf.groupBy("code","col1").agg(count("newid"), countDistinct("newid").orderBy(desc("countDistinct("newid")"))).dropDuplicates('code') – BENY Jun 04 '18 at 18:16
1

I think you're asking the same thing as another [recent question](https://stackoverflow.com/questions/50685522/remove-duplicate-rows-regardless-of-new-information-pyspark). The solution is the same as what Wen suggested- `mydf.groupBy(...).agg(...)...dropDuplicates(subset=["code"])` – pault Jun 04 '18 at 18:28
thanks @Wen and @pault! – thecoder Jun 04 '18 at 20:19

PySpark grouping by 2 columns to get top 1 count per group

0 Answers0