0

I have a pandas dataframe "df" with 2 columns named as ["patientVisit_id", "demandTime",], where patientVisit_id is an int64 column, and demandTime is datetime.

I want to write a python function to group by the number of patientVisit_id each day from hour X for the next Y hours. The output would be two columns ds which indicates the date (starting from the minimum date in the data and ending with the maximum date), and total_patient which indicates the number of patients starting from X on ds for the next Y hours.

Definitely if Y is greater than 24, it would be overlap but it is fine.

import pandas as pd

df = pd.DataFrame({
    "patientVisit_id": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    "demandTime": pd.to_datetime([
        "2023-06-06 06:00:00", "2023-06-06 07:00:00", "2023-06-06 08:00:00",
        "2023-06-06 09:00:00", "2023-06-06 10:00:00", "2023-06-07 02:00:00",
        "2023-06-07 12:00:00", "2023-06-07 13:00:00", "2023-06-07 14:00:00"
    ])
})

def group_by_duration(df, x, y):
   return output_df

Given the example dataframe above, if x = 6 and y = 22, the output would be

import datetime as dt
df = pd.DataFrame({
    "ds": [dt.date(2023,6,6), dt.date(2023,6,7)],
    "total_patient": [6, 3]
})

Since from 2023/6/6 6AM to 2023/6/7 4AM (next 22 hours) there are 6 patients and from 2023/6/7 6AM to 2023/6/8 4AM there are 3 patients.

I tried pd.resample and pd.crosstab but they did not seem to work. I feel it still has to be some sort of groupby but could not figure it out clearly. Many thanks in advance.

MIMIGA
  • 293
  • 1
  • 9

1 Answers1

0

There is overlapping, so processing each day separately and get number of patient by sum of Trues in boolean mask:

def group_by_duration(df, x, y):
    out = []
    for d in df['demandTime'].dt.normalize().drop_duplicates():
        start = d + pd.Timedelta(x, 'hour')
        end = start + pd.Timedelta(y, 'hour')
        m = df['demandTime'].between(start, end)
        out.append((d.date(), m.sum()))
        
    return pd.DataFrame(out, columns=["ds", "total_patient"])

print (group_by_duration(df, 6, 22))
           ds  total_patient
0  2023-06-06              6
1  2023-06-07              3

If no overlaping is possible solution change by:

def group_by_duration(df, x, y):
    df['ds'] = df['demandTime'].dt.date
    start = df['demandTime'].dt.normalize() + pd.Timedelta(x, 'hour')
    end = start + pd.Timedelta(y, 'hour')
    df['total_patient'] = df['demandTime'].between(start, end)

    return df.groupby('ds', as_index=False)["total_patient"].sum()

print (group_by_duration(df, 6, 5))

           ds  total_patient
0  2023-06-06              5
1  2023-06-07              0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    If there was no overlapping (y<=24), would it be possible not to use the loop? – MIMIGA Jun 07 '23 at 06:44
  • @MIMIGA - Yes, then is possible use similar solution, answer was edited. – jezrael Jun 07 '23 at 06:49
  • @MIMIGA - hmmm, overlaping is if need count next day, e.g. for first `2023-06-06` is count some rows from `2023-06-07`. If always end datetime include same date, then no overlap. So here if start `6:00` y should be maximal 17 for `23:00` – jezrael Jun 07 '23 at 06:53
  • The key for this question is that it gets next day included. Only group by today makes it a lot easier. – MIMIGA Jun 07 '23 at 16:49
  • @MIMIGA - Then is necessary use first solution. – jezrael Jun 08 '23 at 07:55