GroupBy using Time Frequency on PySpark DataFrame Issue

Question

I am novice to PySpark .

I am trying to perform a GroupBy operation to get the aggregated count. But I am not able to perform a groupBy based on time frequency. I need to perform "groupBy" using the fields "CAPTUREDTIME, NODE, CHANNEL, LOCATION, TACK". But in this groupBy I should group based on "hourly","daily","weekly", "monthly" using the "CAPTUREDTIME" field.

Please find the below sample data.

-----------------+------+------+--------+----------+--------------

|CAPTUREDTIME|      NODE|       CHANNEL  |  LOCATION|    TACK

+-----------------+------+------+--------+----------+-------------

|20-05-09 03:06:21|   PUSC_RES|   SIMPLEX|  NORTH_AL|    UE220034

|20-05-09 04:33:04|   PUSC_RES|   SIMPLEX|  SOUTH_AL|    UE220034

|20-05-09 12:04:52|   TESC_RES|   SIMPLEX|  NORTH_AL|    UE220057

|20-05-10 04:24:09|   TESC_RES|   SIMPLEX|  NORTH_AL|    UE220057

|20-05-10 04:33:04|   PUSC_RES|   SIMPLEX|  SOUTH_AL|    UE220034

|20-04-09 10:57:48|   TESC_RES|   SIMPLEX|  NORTH_AL|    UE220057

|20-04-09 12:12:26|   TESC_RES|   SIMPLEX|  NORTH_AL|    UE220057

|20-04-09 03:26:33|   PUSC_RES|   SIMPLEX|  NORTH_AL|    UE220071

+-----------------+------+------+--------+----------+-------------

I have used the below pyspark code

df = df.groupby("CAPTUREDTIME", "NODE", "CHANNEL", "LOCATION", "TACK").agg(
    func.count("TACK").alias("count")
)

How can I extend the above code to group on 'hourly','daily', 'weekly','monthly' ?

I require the output in below format(have shared sample output):

HOURLY :

|20-05-09 03:00:00| PUSC_RES| SIMPLEX| NORTH_AL| UE220034| 2

|20-05-09 04:00:00| PUSC_RES| SIMPLEX| SOUTH_AL| UE220034| 2

DAILY :

|20-05-09 00:00:00| PUSC_RES| SIMPLEX| NORTH_AL| UE220034| 1

|20-05-09 00:00:00| PUSC_RES| SIMPLEX| SOUTH_AL| UE220034| 2

|20-05-09 00:00:00| TESC_RES| SIMPLEX| NORTH_AL| UE220057| 3

WEEKLY :

|20-05-09 00:00:00| PUSC_RES| SIMPLEX| NORTH_AL| UE220034| 1

MONTHLY :

|20-05-09 00:00:00| PUSC_RES| SIMPLEX| NORTH_AL| UE220034| 1

Take a look at https://stackoverflow.com/questions/30949202/spark-dataframe-timestamptype-how-to-get-year-month-day-values-from-field — Lars Skaug, Jul 30 '20 at 11:46
you need to create the column you want ie "hourly","daily","weekly", "monthly" from your datetime column and then group by — Steven, Jul 30 '20 at 12:14
@LarsSkaug: The link tells how to split the field. Here I need to perform groupby on the existing fields. It did not address this question — stacktesting, Jul 30 '20 at 13:05
What is the expected answer? Can you please share in a table format for better understanding — dsk, Jul 30 '20 at 17:44

Lars Skaug · Answer 1 · 2020-08-03T14:00:20.317

Spark provides a relatively rich library for date manipulation. The answer to your question is a combination of extraction of date parts and date formatting for display.

I re-created your data as follows:

val capturesRaw = spark.read
  .option("ignoreLeadingWhiteSpace", "true")
  .option("ignoreTrailingWhiteSpace", "true")
  .option("delimiter", "|")
  .option("header", "true")
  .csv(spark.sparkContext.parallelize("""
      CAPTUREDTIME|      NODE|       CHANNEL  |  LOCATION|    TACK
      20-05-09 03:06:21|   PUSC_RES|   SIMPLEX|  NORTH_AL|    UE220034
      20-05-09 04:33:04|   PUSC_RES|   SIMPLEX|  SOUTH_AL|    UE220034
      20-05-09 12:04:52|   TESC_RES|   SIMPLEX|  NORTH_AL|    UE220057
      20-05-10 04:24:09|   TESC_RES|   SIMPLEX|  NORTH_AL|    UE220057
      20-05-10 04:33:04|   PUSC_RES|   SIMPLEX|  SOUTH_AL|    UE220034
      20-04-09 10:57:48|   TESC_RES|   SIMPLEX|  NORTH_AL|    UE220057
      20-04-09 12:12:26|   TESC_RES|   SIMPLEX|  NORTH_AL|    UE220057
      20-04-09 03:26:33|   PUSC_RES|   SIMPLEX|  NORTH_AL|    UE220071"""
  .split("\n")).toDS)

Note: I use Scala, but the difference in the code is so small I hope you find it understandable. I believe the val in the beginning is the only difference in fact.

I assume the first two digits represent a two-digit year? To proceed, we need to make sure capturedtime is a timestamp. I prefer to use SQL to manipulate dataframes, as I find it more readable.

spark.sql("""select to_timestamp('20' || capturedtime) capturedtime, NODE, CHANNEL, 
             LOCATION, TACK from captures_raw""")
    .createOrReplaceTempView("captures_raw")

The same thing can be done on the dataframe directly, if you prefer

capturesRaw.withColumn("capturedtimestamp",
        to_timestamp(col("capturedtime"), "yy-MM-dd hh:mm:ss"))

At this point, we can create the fields you requested:

spark.sql("""select capturedtime,
           month(capturedtime) cap_month, 
           weekofyear(capturedtime) cap_week,   
           day(capturedtime) cap_day, 
           hour(capturedtime) cap_hr, NODE, CHANNEL, LOCATION, TACK 
           from captures_raw""").createOrReplaceTempView("captures")

With the fields created, we are ready to answer your question. To aggregate by month alone (without rest of the timestamp), for instance, proceed as follows:

spark.sql("""select date_format(capturedtime, "yyyy-MM") year_month, cap_month,
             cap_week, cap_day, cap_hr, count(*) count
             from captures 
             group by 1,2,3,4,5""").show

Which returns

+----------+---------+--------+-------+------+-----+
|year_month|cap_month|cap_week|cap_day|cap_hr|count|
+----------+---------+--------+-------+------+-----+
|   2020-04|        4|      15|      9|     3|    1|
|   2020-04|        4|      15|      9|    10|    1|
|   2020-05|        5|      19|      9|     4|    1|
|   2020-05|        5|      19|      9|    12|    1|
|   2020-04|        4|      15|      9|    12|    1|
|   2020-05|        5|      19|      9|     3|    1|
|   2020-05|        5|      19|     10|     4|    2|
+----------+---------+--------+-------+------+-----+

A daily summary can be produced as follows:

spark.sql("""select date_format(capturedtime, "yyyy-MM-dd") captured_date,
             cap_day, cap_hr, count(*) count
             from captures 
             group by 1,2,3""").show

+-------------+-------+------+-----+
|captured_date|cap_day|cap_hr|count|
+-------------+-------+------+-----+
|   2020-05-10|     10|     4|    2|
|   2020-04-09|      9|    12|    1|
|   2020-05-09|      9|     4|    1|
|   2020-05-09|      9|    12|    1|
|   2020-04-09|      9|     3|    1|
|   2020-04-09|      9|    10|    1|
|   2020-05-09|      9|     3|    1|
+-------------+-------+------+-----+

In the above query that is written, we perform a groupBy on the individua l time frequencies which is fine. But how to get the original value of "CAPTURED TIME" in the output for each row?. — stacktesting, Aug 03 '20 at 06:31
Meaning in a daily agggregate, the value of the "CAPTURED TIME" should be a date alone, for weekly, it should be that particular week, — stacktesting, Aug 03 '20 at 06:48
I updated my answer with examples of a monthly and daily summary. — Lars Skaug, Aug 03 '20 at 14:00
Can we not use the window function . What would be the issue df = df.groupby(window('CAPTUREDTIME','60 minutes'), "NODE", "CHANNEL", "LOCATION", "TACK").agg( func.count("TACK").alias("count") df = df.groupby(window('CAPTUREDTIME','1 day'), "NODE", "CHANNEL", "LOCATION", "TACK").agg( func.count("TACK").alias("count") df = df.groupby(window('CAPTUREDTIME','1 week'), "NODE", "CHANNEL", "LOCATION", "TACK").agg( func.count("TACK").alias("count") df = df.groupby(window('CAPTUREDTIME','1 month'), "NODE", "CHANNEL", "LOCATION", "TACK").agg( func.count("TACK").alias("count") — stacktesting, Aug 05 '20 at 15:06
No. Was checking the window options. But do not know the functioning of window function in this case though — stacktesting, Aug 06 '20 at 06:28
Please update your question with the result you expect to see. — Lars Skaug, Aug 06 '20 at 14:26

tricky · Accepted Answer · 2020-08-13T09:05:48.067

You have two ways to answer your issue, either you cast your timestamps to the date granularity you want to group by with or (as you said in the comments) you use the sql window function to group by interval you'd like.

Just know that monthly aggregation are not possible through the window SQL function in Spark.

Here you can see the code, first three examples use the window SQL function and the last example cast the timestamp monthly and then group by every columns.

df = spark.createDataFrame(
    [
        ("20-05-09 03:06:21", "PUSC_RES", "SIMPLEX", "NORTH_AL", "UE220034"),
        ("20-05-09 04:33:04", "PUSC_RES", "SIMPLEX", "SOUTH_AL", "UE220034"),
        ("20-05-09 12:04:52", "TESC_RES", "SIMPLEX", "NORTH_AL", "UE220057"),
        ("20-05-10 04:24:09", "TESC_RES", "SIMPLEX", "NORTH_AL", "UE220057"),
        ("20-05-10 04:33:04", "PUSC_RES", "SIMPLEX", "SOUTH_AL", "UE220034"),
        ("20-04-09 10:57:48", "TESC_RES", "SIMPLEX", "NORTH_AL", "UE220057"),
        ("20-04-09 12:12:26", "TESC_RES", "SIMPLEX", "NORTH_AL", "UE220057"),
        ("20-04-09 03:26:33", "PUSC_RES", "SIMPLEX", "NORTH_AL", "UE220071")
    ],
    ['CAPTUREDTIME', 'NODE', 'CHANNEL', 'LOCATION', 'TACK']
)

from pyspark.sql.functions import col, count, date_format, date_sub, date_trunc, month, next_day, to_timestamp, weekofyear, window, year

Hourly

I still keep the window logic just for this one, so we can reference for everyone every possibility in Spark. I only select the start of the window at the end before showing the dataframe.

hourly = (
    df
    .withColumn("captured_time", to_timestamp(col('CAPTUREDTIME'), 'yy-MM-dd HH:mm:ss'))
    .groupBy(window(col("captured_time"), "1 hour").alias("captured_time"), "NODE", "CHANNEL", "LOCATION", "TACK")
    .agg(count("*"))
    .withColumn("captured_time_hour", col("captured_time.start"))
    .drop("captured_time")
)
hourly.sort("captured_time_hour").show(100, False)

Daily

Through the date_trunc function, I can truncate the timestamp only considering the day

daily = (
    df
    .withColumn("captured_time", to_timestamp(col('CAPTUREDTIME'), 'yy-MM-dd HH:mm:ss'))
    .withColumn("captured_time_day", date_trunc("day", col("captured_time")))
    .groupBy("captured_time_day", "NODE", "CHANNEL", "LOCATION", "TACK")
    .agg(count("*"))
)
daily.sort("captured_time_day").show(100, False)

Weekly

This one is a bit more tricky. First I use, a next_day function with monday. Please if you consider Sunday as the start of the week, update this code according to it, but I consider monday as the start of the week (it depends of SQL dialects I believe and regions)

Then we can also add a weekofyear function to retrieve the week number as you wanted

weekly = (
    df
    .withColumn("captured_time", to_timestamp(col('CAPTUREDTIME'), 'yy-MM-dd HH:mm:ss'))
    .withColumn("start_day", date_sub(next_day(col("captured_time"), "monday"), 7))
    .groupBy("start_day", "NODE", "CHANNEL", "LOCATION", "TACK")
    .agg(count("*"))
    .withColumn("start_day", to_timestamp(col("start_day")))
    .withColumn("week_of_year", weekofyear(col("start_day")))
)
weekly.sort("start_day").show(100, False)

Monthly

We just format the timestamp as a date, and then cast it back to timestamp. This is just done to show another way of doing it. We could just truncate the timestamp as the daily usecase. I also show two ways of extracting the month name and abbreviation. Just take care of your Spark version as this is tested in Spark 3.0.0

monthly = (
    df
    .withColumn("captured_time", to_timestamp(col('CAPTUREDTIME'), 'yy-MM-dd HH:mm:ss'))
    .withColumn("captured_time_month", date_format(col('captured_time'), '1/M/yyyy'))
    .groupBy(col("captured_time_month"), "NODE", "CHANNEL", "LOCATION", "TACK")
    .agg(count("*").alias("Count TACK"))
    .withColumn("captured_time_month", to_timestamp(col("captured_time_month"), '1/M/yyyy'))
    .withColumn("month", month(col("captured_time_month")))
    .withColumn("month_abbr", date_format(col("captured_time_month"),'MMM'))
    .withColumn("full_month_name", date_format(col("captured_time_month"),'MMMM'))
)
monthly.sort("captured_time_month").show(100, False)

Ciao !

Hello @tricky : How can I get the week number for the weekly groupby ? Also, the 'HH:mm:ss' should be '00:00:00' for the daily,weekly,monthly aggregation. — stacktesting, Aug 13 '20 at 06:48
Hey @stacktesting I just edited the code and post according to what you wanted. Cheers :D — tricky, Aug 13 '20 at 09:06
Thanks @tricky. It worked. But when I change the day in WEEKLY from "monday" to "sunday". It is not working. What am I missing here ? — stacktesting, Aug 16 '20 at 05:50

GroupBy using Time Frequency on PySpark DataFrame Issue

2 Answers2

Hourly

Daily

Weekly

Monthly