Count unique column values given another column in PySpark

Question

I am trying to count Date for each unique ID in Pyspark.

+-------------------+----------+
|               Date|        ID|
+-------------------+----------+
|2022-03-19 00:00:00|   Ax3838J|
|2022-03-11 00:00:00|   Ax3838J|
|2021-11-01 00:00:00|   Ax3838J|
|2021-10-27 00:00:00|   Ax3838J|
|2021-10-25 00:00:00|   Bz3838J|
|2021-10-22 00:00:00|   Bz3838J|
|2021-10-18 00:00:00|   Bz3838J|
|2021-10-15 00:00:00|   Rr7422u|
|2021-09-22 00:00:00|   Rr742uL|
+-------------------+----------+

When I tried

df.groupBy('ID').count('Date').show()

I got the error: _api() takes 1 positional argument but 2 were given which makes sense, but I am not sure what are the other techniques exits to count so in PySpark.

How do I count unique Date values with this:

df.groupBy('ID').count().show()

Expected output:

+-------------------+----------+
|               Date|        ID|
+-------------------+----------+
|                  4|   Ax3838J|
|                  3|   Bz3838J|
|                  2|   Rr742uL|
+-------------------+----------+

Try this then --> df.groupby("ID").agg(countDistinct("Date").alias("count")).show() — Mahesh Gupta, Mar 29 '22 at 12:13

Mahesh Gupta · Accepted Answer · 2022-03-29T12:31:10.013

1

Please find the working version of expected output. I am running code on spark-3.

from pyspark.sql.functions import countDistinct

data = [["2022-03-19 00:00:00", "Ax3838J"], ["2022-03-11 00:00:00", "Ax3838J"], ["2021-11-01 00:00:00", "Ax3838J"], ["2021-10-27 00:00:00", "Ax3838J"], ["2021-10-25 00:00:00", "Bz3838J"], ["2021-10-22 00:00:00", "Bz3838J"], ["2021-10-18 00:00:00", "Bz3838J"], ["2021-10-15 00:00:00", "Rr7422u"], ["2021-09-22 00:00:00", "Rr742uL"]]
df = spark.createDataFrame(data, ['Date', 'ID'])
df.show()
+-------------------+-------+
|               Date|     ID|
+-------------------+-------+
|2022-03-19 00:00:00|Ax3838J|
|2022-03-11 00:00:00|Ax3838J|
|2021-11-01 00:00:00|Ax3838J|
|2021-10-27 00:00:00|Ax3838J|
|2021-10-25 00:00:00|Bz3838J|
|2021-10-22 00:00:00|Bz3838J|
|2021-10-18 00:00:00|Bz3838J|
|2021-10-15 00:00:00|Rr742uL|
|2021-09-22 00:00:00|Rr742uL|
+-------------------+-------+

df.groupby("ID").agg(countDistinct("Date").alias("count")).show()
+-------+-----+
|     ID|count|
+-------+-----+
|Rr742uL|    2|
|Ax3838J|    4|
|Bz3838J|    3|
+-------+-----+

Please let me know if you need any help and if its solve your purpose please accept it

edited Mar 29 '22 at 12:31

answered Mar 29 '22 at 12:23

Mahesh Gupta

1,882
12
16

Try it now it will resolved the issue – Mahesh Gupta Mar 29 '22 at 12:32
perfect, thanks. Btw, if you want more, here is one unresolved one: https://stackoverflow.com/questions/71658675/merge-two-dataframes-with-conditions-in-pyspark/71659117#71659117 – sargupta Mar 29 '22 at 12:34
I believe this link point it to here only. and you can also upvote it as you have already accepted the answer – Mahesh Gupta Mar 29 '22 at 12:36
Updated the link. For reference, this was the initial problem which is a subproblem of the above added link: https://stackoverflow.com/questions/71647549/resampling-datetime-by-date-in-pyspark/71649076?noredirect=1#comment126629011_71649076 – sargupta Mar 29 '22 at 12:37

score 0 · Answer 2 · answered Mar 29 '22 at 12:04

0

try this:

df.groupBy('ID').count(distinct 'Date').show()

answered Mar 29 '22 at 12:04

Ashutosh sharma

76
8

getting `SyntaxError: invalid syntax` right before `Date`. Are we missing anything? – sargupta Mar 29 '22 at 12:12
you can use countDistinct() function instead of count(), also can you show me the error – Ashutosh sharma Mar 29 '22 at 12:20
please check the above comments if you wish to solve more. Links attached. – sargupta Mar 29 '22 at 12:44

Count unique column values given another column in PySpark

2 Answers2