I would like to calculate the distribution of a an item based on how long does it take between the first and the last order of that item. To reach that goal though, first I have to get that time delta for each item.
My initial dataframe has three columns: "Order_ID","Order_DATE","Medium_ID", as the following example:
df = pd.DataFrame({'Medium_ID': {0: '1359',
1: '1360',
2: '1359',
3: '1360',
4: '1360',
5: '1404',
6: '1381',
7: '1359',
8: '1419',
9: '1360'},
'Order_ID': {0: '1',
1: '2',
2: '3',
3: '4',
4: '5',
5: '6',
6: '7',
7: '8',
8: '9',
9: '10'},
'Order_DATE': {0: Timestamp('2008-04-21 00:00:00'),
1: Timestamp('2008-04-21 00:00:00'),
2: Timestamp('2008-04-21 00:00:00'),
3: Timestamp('2008-04-21 00:00:00'),
4: Timestamp('2008-04-22 00:00:00'),
5: Timestamp('2008-04-22 00:00:00'),
6: Timestamp('2008-04-23 00:00:00'),
7: Timestamp('2008-04-23 00:00:00'),
8: Timestamp('2008-04-23 00:00:00'),
9: Timestamp('2008-04-28 00:00:00')}}))
Because we can have multiple order_IDs for the same medium_ID, first thing I have tried to group by the column "Medium_ID", but then I don't know how to evolve.
I would like to have a new dataframe with two columns: "Medium_ID" and "Days_between_the_last_and_the_first-order" and, in the end, to show the distribution for the series "Days_between_the_last_and_the_first-order".