2

I am a fairly new in operating with python. There is a table of orders with specific time flag. However, the "correct" order was split to many rows since orders may be processed at different time and thus has different order ids.

The final goal is to identify or aggregate data on order level according to a specific timespan, that is: as long as the following order_ids from one user are within one hour from the first order of this user, data is supposed to be identified as truely "one order".

So far, I tried to aggregate on a hourly basis and built a key with user_id and incoming_date_time_s, nevertheless it is not as precise as desired

df_1["incoming_date_time_s"] = df_1["incoming_date_time"].apply(lambda x: datetime.datetime(x.year, x.month, x.day, x.hour))

Here is the dataframe of interest

import pandas as pd
import datetime

df_1 = pd.DataFrame({'user_id':['01', '02', '03', '03', '01', '01',  '03', '01', '03', '01'],
'incoming_date_time':['2018-10-31 00:01:56', '2018-10-31 00:03:04', '2018-10-31 00:03:32', '2018-10-31 00:03:38','2018-10-31 00:03:40', '2018-10-31 00:03:53', '2018-10-31 00:03:48', '2018-10-31 00:04:04', '2018-10-31 00:05:20', '2018-10-31 00:05:31'], 
'order_id':[123, 124, 125, 126, 127, 128, 129, 131, 132, 132], 
'Supposed_counting_result':[1, 1, 1, 1, 2, 2, 1, 2, 2, 3]})

df_1['incoming_date_time'] = pd.to_datetime(df_1['incoming_date_time'])

You may see the expected aggregating result should be like in "Supposed_counting_result". Is there a reasonable way to perform this aggregation? Thanks for any idea.

Chris
  • 29,127
  • 3
  • 28
  • 51
Stack_Javi
  • 23
  • 4
  • Seems like `Supposed_counting_result` is based on different **minutes** as opposed to the OP: _within one hour__ – Chris Apr 08 '19 at 12:26
  • Just asking because could help others: the data is supposed to be: order_id[x] has arrived in incoming_date_time[x] with/to user_id[x]. While the supposed_counting_result[x] is how many orders user_id[x] at that time has processed. This field count the processed orders and add 1 order to user_id only if last order is older than 1h. Is this correct? – Raikoug Apr 08 '19 at 12:28
  • @Raikoug: yes, you're completly right with your assumption. – Stack_Javi Apr 08 '19 at 12:44
  • @Chris: it is based on the first order_item of each order, and all order_items within one hour from the first order_item. For instance, the 2nd order of user 01 "starts" at 3:40, however the order_items from 3:53 and 4:04 are within one hour starting from 3:40 and therefore considered as "one" order – Stack_Javi Apr 08 '19 at 12:46
  • @Stack_Javi In the dataset you provided, there's no difference in _hours_, all records are in 00:MM:SS and yet `Supposed_counting_result` is increasing. Can you explain your logic more precisely? – Chris Apr 08 '19 at 12:48
  • I've just another question before giving some code examples: how does you get data, or how is it provided? Strings? Json? XML? How is that formatted? – Raikoug Apr 08 '19 at 12:59
  • @Raikoug: the original data comes from sql queries, but this is replicated data – Stack_Javi Apr 08 '19 at 13:06
  • @Stack_Javi would be correct to assume your db table is an orders table and each row has "user_id", "incoming_date_time" and "order_id" ? – Raikoug Apr 08 '19 at 13:25
  • @Raikoug: yes, correct – Stack_Javi Apr 08 '19 at 14:09

0 Answers0