0

I have a PySpark dataframe mydf and am grouping by 2 columns (code and col1) to obtain a resultant table with highest distinct count of third column (newid).

Eg: mydf

code   col1   newid
100     MNO      1
100     MNO      2
230     LLL      3
245     TTE      4
230     LLL      5
230     LIO      6
100     FGH      7

Expected Result:

code   col1   count(distinct newid)
100     MNO      2
230     LLL      2
245     TTE      1

Current results using the code below:

mydf.groupBy("code","col1").agg(count("newid"), countDistinct("newid")\
    .orderBy(desc("countDistinct('newid')")))

code   col1   newid
100     MNO      2
230     LLL      2
245     TTE      1
100     FGH      1
230     LIO      1
pault
  • 41,343
  • 15
  • 107
  • 149
thecoder
  • 237
  • 6
  • 16
  • adding drop_duplicated for your result df – BENY Jun 04 '18 at 17:51
  • @Wen dropDuplicates doesn't take any arguments in PySpark? Eventually after Order By I want to keep only the first occurence.. – thecoder Jun 04 '18 at 18:09
  • 1
    Like mydf.groupBy("code","col1").agg(count("newid"), countDistinct("newid").orderBy(desc("countDistinct("newid")"))).dropDuplicates('code') – BENY Jun 04 '18 at 18:16
  • 1
    I think you're asking the same thing as another [recent question](https://stackoverflow.com/questions/50685522/remove-duplicate-rows-regardless-of-new-information-pyspark). The solution is the same as what Wen suggested- `mydf.groupBy(...).agg(...)...dropDuplicates(subset=["code"])` – pault Jun 04 '18 at 18:28
  • thanks @Wen and @pault! – thecoder Jun 04 '18 at 20:19

0 Answers0