0

I am trying to count Date for each unique ID in Pyspark.

+-------------------+----------+
|               Date|        ID|
+-------------------+----------+
|2022-03-19 00:00:00|   Ax3838J|
|2022-03-11 00:00:00|   Ax3838J|
|2021-11-01 00:00:00|   Ax3838J|
|2021-10-27 00:00:00|   Ax3838J|
|2021-10-25 00:00:00|   Bz3838J|
|2021-10-22 00:00:00|   Bz3838J|
|2021-10-18 00:00:00|   Bz3838J|
|2021-10-15 00:00:00|   Rr7422u|
|2021-09-22 00:00:00|   Rr742uL|
+-------------------+----------+

When I tried

df.groupBy('ID').count('Date').show()

I got the error: _api() takes 1 positional argument but 2 were given which makes sense, but I am not sure what are the other techniques exits to count so in PySpark.

How do I count unique Date values with this:

df.groupBy('ID').count().show()

Expected output:

+-------------------+----------+
|               Date|        ID|
+-------------------+----------+
|                  4|   Ax3838J|
|                  3|   Bz3838J|
|                  2|   Rr742uL|
+-------------------+----------+
sargupta
  • 953
  • 13
  • 25

2 Answers2

1

Please find the working version of expected output. I am running code on spark-3.

from pyspark.sql.functions import countDistinct

data = [["2022-03-19 00:00:00", "Ax3838J"], ["2022-03-11 00:00:00", "Ax3838J"], ["2021-11-01 00:00:00", "Ax3838J"], ["2021-10-27 00:00:00", "Ax3838J"], ["2021-10-25 00:00:00", "Bz3838J"], ["2021-10-22 00:00:00", "Bz3838J"], ["2021-10-18 00:00:00", "Bz3838J"], ["2021-10-15 00:00:00", "Rr7422u"], ["2021-09-22 00:00:00", "Rr742uL"]]
df = spark.createDataFrame(data, ['Date', 'ID'])
df.show()
+-------------------+-------+
|               Date|     ID|
+-------------------+-------+
|2022-03-19 00:00:00|Ax3838J|
|2022-03-11 00:00:00|Ax3838J|
|2021-11-01 00:00:00|Ax3838J|
|2021-10-27 00:00:00|Ax3838J|
|2021-10-25 00:00:00|Bz3838J|
|2021-10-22 00:00:00|Bz3838J|
|2021-10-18 00:00:00|Bz3838J|
|2021-10-15 00:00:00|Rr742uL|
|2021-09-22 00:00:00|Rr742uL|
+-------------------+-------+

df.groupby("ID").agg(countDistinct("Date").alias("count")).show()
+-------+-----+
|     ID|count|
+-------+-----+
|Rr742uL|    2|
|Ax3838J|    4|
|Bz3838J|    3|
+-------+-----+

Please let me know if you need any help and if its solve your purpose please accept it

Mahesh Gupta
  • 1,882
  • 12
  • 16
  • Try it now it will resolved the issue – Mahesh Gupta Mar 29 '22 at 12:32
  • perfect, thanks. Btw, if you want more, here is one unresolved one: https://stackoverflow.com/questions/71658675/merge-two-dataframes-with-conditions-in-pyspark/71659117#71659117 – sargupta Mar 29 '22 at 12:34
  • I believe this link point it to here only. and you can also upvote it as you have already accepted the answer – Mahesh Gupta Mar 29 '22 at 12:36
  • Updated the link. For reference, this was the initial problem which is a subproblem of the above added link: https://stackoverflow.com/questions/71647549/resampling-datetime-by-date-in-pyspark/71649076?noredirect=1#comment126629011_71649076 – sargupta Mar 29 '22 at 12:37
0

try this:

df.groupBy('ID').count(distinct 'Date').show()