Want to aggregate by an hour for a combination of 2 columns

Question

The user traffic to our portal is logged in DB and we extracted the traffic information which is in below format

LOB  timestamp            Transaction Hits
PRO  2020-09-03 17:51:16  LOGIN       1
PRO  2020-09-03 17:51:15    ELG       1
PRO  2020-09-03 17:51:12  LOGIN       4
PRO  2020-09-03 17:51:13    ELG      11
PRO  2020-09-03 17:51:14  LOGIN       3
PRO  2020-09-03 17:51:11    ELG       2

I want to find the hourly hits to the portal for the combination of LOB and Transaction. The output needs to be in this format

2020-09-03 17:00:00 14    POS  ELG
                    8     POS  LOGIN

How can I do this using PANDAS?

Patrick Artner · Accepted Answer · 2020-10-09T06:15:56.537

You can do this by reshaping your datetime value to hourly before grouping. I opted to create a new column with the "fixed" time and group by it:

Create demo df:

import pandas as pd
from io import StringIO

csv_string = StringIO("""LOB,  timestamp,            Transaction, Hits
PRO,  2020-09-03 17:51:16,  LOGIN,       1
PRO,  2020-09-03 17:51:15,    ELG,       1
PRO,  2020-09-03 17:51:12,  LOGIN,       4
PRO,  2020-09-03 17:51:13,    ELG,      11
PRO,  2020-09-03 17:51:14,  LOGIN,       3
PRO,  2020-09-03 17:51:11,    ELG,       2
PRO,  2020-09-03 18:51:12,  LOGIN,      24
PRO,  2020-09-03 18:51:13,    ELG,      21
PRO,  2020-09-03 18:51:14,  LOGIN,      23
PRO,  2020-09-03 18:51:11,    ELG,      22""" )

df = pd.read_csv(csv_string, sep=",", skipinitialspace=True)

and work with it:

# convert timestamp column to datetime
df["timestamp"] = pd.to_datetime(df["timestamp"])

# create a fixed time column with hours
# cudos: https://stackoverflow.com/a/43400370/7505395
df["by_hour"] = pd.to_datetime(df["timestamp"].dt.date) + \
                pd.to_timedelta(df["timestamp"].dt.hour, unit="H")

print(df)

# group by, use as index
grouped = df.groupby(by=["by_hour", "Transaction"], as_index=True) 
# sum and print
print(grouped.sum())

Output:

  LOB           timestamp Transaction  Hits             by_hour
0  PRO 2020-09-03 17:51:16       LOGIN     1 2020-09-03 17:00:00
1  PRO 2020-09-03 17:51:15         ELG     1 2020-09-03 17:00:00
2  PRO 2020-09-03 17:51:12       LOGIN     4 2020-09-03 17:00:00
3  PRO 2020-09-03 17:51:13         ELG    11 2020-09-03 17:00:00
4  PRO 2020-09-03 17:51:14       LOGIN     3 2020-09-03 17:00:00
5  PRO 2020-09-03 17:51:11         ELG     2 2020-09-03 17:00:00
6  PRO 2020-09-03 18:51:12       LOGIN    24 2020-09-03 18:00:00
7  PRO 2020-09-03 18:51:13         ELG    21 2020-09-03 18:00:00
8  PRO 2020-09-03 18:51:14       LOGIN    23 2020-09-03 18:00:00
9  PRO 2020-09-03 18:51:11         ELG    22 2020-09-03 18:00:00
                                Hits

by_hour             Transaction
2020-09-03 17:00:00 ELG            14
                    LOGIN           8
2020-09-03 18:00:00 ELG            43
                    LOGIN          47

Nice, this definitely does the trick, the only thing I would add is a reset for the Index to make sure the output matches the requirement — Mario Rojas, Oct 09 '20 at 06:06
@Mario good thinking, added as_index=True so it uses the grouping column as index — Patrick Artner, Oct 09 '20 at 06:16

Want to aggregate by an hour for a combination of 2 columns

1 Answers1