0

Say I have data of this format in a df

id      sta                   end                   dur
40433   2020-01-08 05:06:01   2020-01-08 05:08:14   133
40433   2020-09-22 12:01:26   2020-09-22 12:31:34   1808
40433   2020-09-22 12:05:00   2020-09-22 13:05:00   3600

Maybe in the same df or in a new df, I want to add records that look like this:

id      sta                  end                   h1  dur
40433   2020-01-08 05:06:01  2020-01-08 05:08:14   05  133
40433   2020-09-22 12:01:26  2020-09-22 12:31:34   12  1808
40433   2020-09-22 12:05:00  2020-09-22 13:05:00   12  3300
40433   2020-09-22 12:05:00  2020-09-22 13:05:00   13  300

dur is in seconds.

I want to groupby id, then day (extracted from sta), then h1, h2, etc., aggregating the dur for the particular hours (h1, etc.), for that id.

oompaloompa
  • 3
  • 1
  • 5
  • How are you specifically trying to create your bins? What are the criteria for the value in h1 being 12 vs 13. It's not clear to me from your example. – Nick Becker Dec 15 '20 at 16:49
  • Ah ok.. for the last row, it's 12 & 13 because those are the hours in which there's activity. Specifically, 5 mins of activity in h12, and 5 mins of activity in h13. – oompaloompa Dec 15 '20 at 22:56

1 Answers1

0

revised answer based on your comments. For a quicker turn around, I went to array math with a few conversions after trying a few other methods. There may be a more efficient way, not sure how it performs at scale, but it works. Has the caveat that if your duration goes over 24 hours total, ALL hour column values will be all 60 minutes, so I left that condition alone so that you may deal with it as you see fit:

import cudf
import cupy as cp

#If your duration goes over 24 hours total, ALL hour column values will be all 60 minutes.

sta = ['2020-01-08 05:06:01', '2020-09-22 12:01:26', '2020-09-22 12:05:00', '2020-09-22 01:15:00', '2020-09-22 21:05:00']
end = ['2020-01-08 05:08:14', '2020-09-22 12:31:34', '2020-09-22 13:05:00', '2020-09-22 08:05:00', '2020-09-23 01:05:00']

#put it in a dataframe
df = cudf.DataFrame({'sta': sta, 'end':end})
print(df.head())

#the object is a string, so let's convert it to date time
df['sta']= df['sta'].astype('datetime64[s]')
df['end']=df['end'].astype('datetime64[s]')

df['dur']=(df['end']-df['sta']).astype('int64')

#create new df of same type to convert to cupy (to preserve datetime values)
df2=cudf.DataFrame() 
df2['dur']=(df['end']-df['sta']).astype('int64')
df2['min_sta'] =df['sta'].dt.minute.astype('int64')
df2['min_end']= df['end'].dt.minute.astype('int64')
df2['h_sta']= df['sta'].dt.hour.astype('int64')
df2['h_end']= df['end'].dt.hour.astype('int64')
df2['day']=df['sta'].dt.day.astype('int64')
print(df2)

#convert df2's values from df to cupy array (you can use numpy if on pandas)
a = cp.fromDlpack(df2.to_dlpack())
print(a)

#create new temp cupy array b to contain minute duration per hour.  This algo will work with numpy by using mumpy instead of cupy
b = cp.zeros((len(a),24))
for j in range(0,len(a)):
    hours = int((a[j][0]/3600)+(a[j][1]/60))
    if(hours==0): # within same hour
        b[j][a[j][3]] = int(a[j][0]/60)
    elif(hours==1): #you could probably delete this condition.
        b[j][a[j][3]] = 60-a[j][1]
        b[j][a[j][4]] = a[j][2]
    else:
        b[j][a[j][3]] = 60-a[j][1]
        if(hours<24): #all array elements will be all 60 minutes if duration is over 24 hours
            if(a[j][3]+hours<24):
                b[j][a[j][3]+1:a[j][3]+hours]=60
                b[j][a[j][4]] = a[j][2]
            else:
                b[j][a[j][3]+1:24]=60
                b[j][0:(a[j][3]+1+hours)%24]=60
                b[j][a[j][4]] = a[j][2]
# bring cupy array b back to a df. 
reshaped_arr = cp.asfortranarray(b)
cpdf = cudf.from_dlpack(reshaped_arr.toDlpack())
print(cpdf.head())

#concat the original and cupy df
df = cudf.concat([df, cpdf], axis=1)
print(df.head())
#you can rename the columns with "h" as you wish
TaureanDyerNV
  • 1,208
  • 8
  • 9
  • Hi, I'm not sure it's this simple. I've done this, but my intent is to capture the minutes in each hour for which the 'sta' to 'end' spans. Example: 4 2020-09-22 12:05:00 2020-09-23 14:05:00 3600 12 13 14 15 h12 has 5 (mins), h13 has 60, h14 has 5 mins. I want to cut the duration (12:05 to 14:05) into hour bins. Hope that makes sense... – oompaloompa Dec 15 '20 at 22:59
  • Ah, much clearer. will revise my answer. – TaureanDyerNV Dec 16 '20 at 23:38
  • FYI, your proposed hourly bin scheme will break on durations> 23 hours. Is that okay? – TaureanDyerNV Dec 17 '20 at 10:45
  • Thank you, you're an absolute wizard. If only there was a way to speed up groupby and wide_to_long ... – oompaloompa Jan 01 '21 at 03:29
  • Hey @TaureanDyerNV, would it be possible to do this with a sparse matrix 'b'? – oompaloompa Jan 11 '21 at 03:13
  • probably. I was just trying to help you get started :). Cupy supports sparse matrix operations using cuSPARSE: https://docs.cupy.dev/en/stable/reference/sparse.html. – TaureanDyerNV Jan 20 '21 at 20:41
  • Of course, just meant to ask if this was possible w/ both numpy & cupy. At the moment, I'm looking at numba instead of cupy. Cheers! – oompaloompa Jan 22 '21 at 06:28