0

I have a dataframe. Here is a very limited example.

userID    scoreMetric    timeStamp    
123       5              2020-05-16 22:03:19
313       1              2020-05-16 22:03:20
123       4              2020-05-16 22:13:20
313       2              2020-05-16 22:13:20
123       2              2020-05-16 22:32:20
555       5              2020-05-16 22:32:20
123       2              2020-05-16 22:32:56

Here is code for testing

import pandas as pd
from datetime import datetime

data = { 'userID': [123, 313, 123, 313, 123, 555, 123], 'scoreMetric': [5,1,4,2,2,5,2], 'timeStamp': ['2020-05-16 22:03:19', '2020-05-16 22:03:20', '2020-05-16 22:13:20', '2020-05-16 22:13:20', '2020-05-16 22:32:20', '2020-05-16 22:32:20', '2020-05-16 22:32:56'] }

df = pd.DataFrame( data, columns=['userID', 'scoreMetric', 'timeStamp'])
df.timeStamp = list( map( lambda x: datetime.strftime( datetime.strptime( x, '%Y-%m-%d %H:%M:%S'), '%Y-%m-%d %H:%M:%S' ), df.timeStamp ) )
df.timeStamp = pd.to_datetime( df.timeStamp )

What I want is the per-user-average, divided by the number of users in a time block.

Let's take 22:00:01-22:30:00 as our time-block.

We have 2 users (123, 313) here each with two time stamps. So I would want [ (5+4)/2 + (1+2)/2 ] / 2 = [ 4.5 + 1.5 ] / = 3 as my returned value.

Now we take 22:30:01 to 23:00:00. We have 2 users again (123, 555) here with variable time stamps. So I would want [ (2+2)/2 + (5)/1 ] / 2 = [ 2 + 5 ] / 2 = 3.5

I think there should be an efficient way to do this via resampling.

For example, when I just calculate the averages I can do

df.set_index('timeStamp').scoreMetric.resample('30M').mean().fillna(0)

Which then gives me the mean of all score metrics within 30 minutes.

I've tried a few things already, but nothing seems to work out for doing the sampling, but then doing calculations on a per-user basis first.

Is there no convenient way to do this as a simple statement like that? Do I need to manually chunk apart the time-series, then do the averages etc piecewise?

Thanks!

Jibril
  • 967
  • 2
  • 11
  • 29
  • resample, group by time and user , take mean. Take resulting df, group by time and take mean – Artyom Akselrod Sep 11 '20 at 12:38
  • I cant seem to get that to work. For example - df.set_index('timeStamp').scoreMetric.resample('30T').mean() works fine, but df.set_index('timeStamp').scoreMetric.resample('30T').groupby('userID') gives a TypeError. – Jibril Sep 11 '20 at 12:56
  • the answer is below, but you can try: df.set_index('timeStamp').scoreMetric.resample('30T').reset_index().groupby(['timestamp', 'userID']) – Artyom Akselrod Sep 11 '20 at 13:03

2 Answers2

2

IIUC, use pd.Grouper to group by 30 minutes and also userID to obtain mean value per ID, and another groupby on time only to get mean value per time:

print (df.groupby([pd.Grouper(key="timeStamp", freq="1800s"), "userID"])
         .mean()["scoreMetric"].groupby(level=0).mean())

timeStamp
2020-05-16 22:00:00    3.0
2020-05-16 22:30:00    3.5
Henry Yik
  • 22,275
  • 4
  • 18
  • 40
  • Interesting! This seems to be the result I expected. Do you mind if I explain to see if I understand? So you use groupby to to collectively group the timestamp and the userID. You use pd.Grouper to split the timestamps up first. Then you take the mean which is applied to each scoreMetric. Then groupby(level=0).mean() means it divides by the number of unique entries? – Jibril Sep 11 '20 at 13:01
  • I added explanation on the above @Jibril. – Henry Yik Sep 11 '20 at 13:04
0

If I understand correctly you can do the following:

import numpy as np
df.set_index('timeStamp',inplace=True)

here you can filter the dates:

data=df[(df.index > '2020-05-16 22:00:01') & (df.index  < '2020-05-16 22:30:00 ')]

and then:

data.groupby('userID')['scoreMetric'].mean().mean()

output:

3
Billy Bonaros
  • 1,671
  • 11
  • 18