Group by hour in pyspark?

Question

I have dataframe which contain time column which is in string format.

dataframe=

      time         value
      00:00:00      10
      00:23:00      5
      00:59:00      23 
      01:23:34      34
      01:56:00      34

Every time i try to group by hours on Time column it give output like below this :-

 hour    count
  0       38
  1       68

But I want Out put like this..

 hour    count
  00       38
  01       68

For this i wrote the query like below ;-

    dataframe.groupBy(hour('time')).agg({'value':'count'})

I think you are trying to achieve sum instead of count. – Mahesh Gupta Oct 09 '19 at 05:45 — Mahesh Gupta, Oct 09 '19 at 05:45

pissall · Accepted Answer · 2019-10-09T06:22:31.213

Quoting substring multiple characters from the last index of a pyspark string column using negative indexing

Since your time column is in StringType, we can use substring to get the hour as you want, and group on it as StringType

from pyspark.sql.functions import substring, col

df = df.withColumn("hour", substring(F.col("time"), 0, 2))
group_df = df.groupby("hour").sum("value") # or whichever aggregation you want

Group by hour in pyspark?

1 Answers1