I am a fairly new in operating with python. There is a table of orders with specific time flag. However, the "correct" order was split to many rows since orders may be processed at different time and thus has different order ids.
The final goal is to identify or aggregate data on order level according to a specific timespan, that is: as long as the following order_ids from one user are within one hour from the first order of this user, data is supposed to be identified as truely "one order".
So far, I tried to aggregate on a hourly basis and built a key with user_id and incoming_date_time_s, nevertheless it is not as precise as desired
df_1["incoming_date_time_s"] = df_1["incoming_date_time"].apply(lambda x: datetime.datetime(x.year, x.month, x.day, x.hour))
Here is the dataframe of interest
import pandas as pd
import datetime
df_1 = pd.DataFrame({'user_id':['01', '02', '03', '03', '01', '01', '03', '01', '03', '01'],
'incoming_date_time':['2018-10-31 00:01:56', '2018-10-31 00:03:04', '2018-10-31 00:03:32', '2018-10-31 00:03:38','2018-10-31 00:03:40', '2018-10-31 00:03:53', '2018-10-31 00:03:48', '2018-10-31 00:04:04', '2018-10-31 00:05:20', '2018-10-31 00:05:31'],
'order_id':[123, 124, 125, 126, 127, 128, 129, 131, 132, 132],
'Supposed_counting_result':[1, 1, 1, 1, 2, 2, 1, 2, 2, 3]})
df_1['incoming_date_time'] = pd.to_datetime(df_1['incoming_date_time'])
You may see the expected aggregating result should be like in "Supposed_counting_result". Is there a reasonable way to perform this aggregation? Thanks for any idea.