I have a dataframe. Here is a very limited example.
userID scoreMetric timeStamp
123 5 2020-05-16 22:03:19
313 1 2020-05-16 22:03:20
123 4 2020-05-16 22:13:20
313 2 2020-05-16 22:13:20
123 2 2020-05-16 22:32:20
555 5 2020-05-16 22:32:20
123 2 2020-05-16 22:32:56
Here is code for testing
import pandas as pd
from datetime import datetime
data = { 'userID': [123, 313, 123, 313, 123, 555, 123], 'scoreMetric': [5,1,4,2,2,5,2], 'timeStamp': ['2020-05-16 22:03:19', '2020-05-16 22:03:20', '2020-05-16 22:13:20', '2020-05-16 22:13:20', '2020-05-16 22:32:20', '2020-05-16 22:32:20', '2020-05-16 22:32:56'] }
df = pd.DataFrame( data, columns=['userID', 'scoreMetric', 'timeStamp'])
df.timeStamp = list( map( lambda x: datetime.strftime( datetime.strptime( x, '%Y-%m-%d %H:%M:%S'), '%Y-%m-%d %H:%M:%S' ), df.timeStamp ) )
df.timeStamp = pd.to_datetime( df.timeStamp )
What I want is the per-user-average, divided by the number of users in a time block.
Let's take 22:00:01-22:30:00 as our time-block.
We have 2 users (123, 313) here each with two time stamps. So I would want [ (5+4)/2 + (1+2)/2 ] / 2 = [ 4.5 + 1.5 ] / = 3 as my returned value.
Now we take 22:30:01 to 23:00:00. We have 2 users again (123, 555) here with variable time stamps. So I would want [ (2+2)/2 + (5)/1 ] / 2 = [ 2 + 5 ] / 2 = 3.5
I think there should be an efficient way to do this via resampling.
For example, when I just calculate the averages I can do
df.set_index('timeStamp').scoreMetric.resample('30M').mean().fillna(0)
Which then gives me the mean of all score metrics within 30 minutes.
I've tried a few things already, but nothing seems to work out for doing the sampling, but then doing calculations on a per-user basis first.
Is there no convenient way to do this as a simple statement like that? Do I need to manually chunk apart the time-series, then do the averages etc piecewise?
Thanks!