2

I would like to calculate the distribution of a an item based on how long does it take between the first and the last order of that item. To reach that goal though, first I have to get that time delta for each item.

My initial dataframe has three columns: "Order_ID","Order_DATE","Medium_ID", as the following example:

df = pd.DataFrame({'Medium_ID': {0: '1359',
 1: '1360',
 2: '1359',
 3: '1360',
 4: '1360',
 5: '1404',
 6: '1381',
 7: '1359',
 8: '1419',
 9: '1360'},

 'Order_ID': {0: '1',
 1: '2',
 2: '3',
 3: '4',
 4: '5',
 5: '6',
 6: '7',
 7: '8',
 8: '9',
 9: '10'},

'Order_DATE': {0: Timestamp('2008-04-21 00:00:00'),
 1: Timestamp('2008-04-21 00:00:00'),
 2: Timestamp('2008-04-21 00:00:00'),
 3: Timestamp('2008-04-21 00:00:00'),
 4: Timestamp('2008-04-22 00:00:00'),
 5: Timestamp('2008-04-22 00:00:00'),
 6: Timestamp('2008-04-23 00:00:00'),
 7: Timestamp('2008-04-23 00:00:00'),
 8: Timestamp('2008-04-23 00:00:00'),
 9: Timestamp('2008-04-28 00:00:00')}}))

Because we can have multiple order_IDs for the same medium_ID, first thing I have tried to group by the column "Medium_ID", but then I don't know how to evolve.

I would like to have a new dataframe with two columns: "Medium_ID" and "Days_between_the_last_and_the_first-order" and, in the end, to show the distribution for the series "Days_between_the_last_and_the_first-order".

rafspo
  • 143
  • 8

2 Answers2

3

You can calculate the days between first and last order for each item like:

df.groupby('Medium_ID').Order_DATE.apply(lambda x: x.max() - x.min())

Which results in:

Medium_ID
1359   2 days
1360   7 days
1381   0 days
1404   0 days
1419   0 days
Franco Piccolo
  • 6,845
  • 8
  • 34
  • 52
1

For the days between the last and the first order date, you can try this.

grouped = (
    df.drop("Order_ID", axis=1)
    .sort_values(["Medium_ID", "Order_DATE"])
    .groupby("Medium_ID")
    .agg(["first", "last"])
)
grouped.columns = ["first_order_date", "last_order_date"]
grouped.reset_index(inplace=True)
grouped["days_between_last_and_first_order"] = (
    grouped["last_order_date"] - grouped["first_order_date"]
).dt.days
grouped = grouped[["Medium_ID", "days_between_last_and_first_order"]]

Or, using @Franco's solution would be,

grouped = df.groupby("Medium_ID")["Order_DATE"].apply(
    lambda x: x.max() - x.min()
).to_frame().reset_index().rename(
    {"Order_DATE": "days_between_last_and_first_order"}, axis=1
)
grouped["days_between_last_and_first_order"] = grouped["days_between_last_and_first_order"].dt.days

To visualize the distribution,

grouped.hist(column="days_between_last_and_first_order")
pythonjokeun
  • 431
  • 2
  • 8
  • Thanks! By the part "show the distribution..." I mean showing the histogram of the occurrences for each timedelta. Let me explain: if the dtype of the column _grouped. days_between_last_and_first_order_ was an INT I could have something like this: **grouped.hist(column=" days_between_last_and_first_order")** that shows the "distribution" of occurrences. – rafspo Apr 19 '19 at 18:36
  • 1
    @rafspo I've updated my answer to suits your need. So the idea is, extract the number from `days_between_last_and_first_order` column using `.dt.days` – pythonjokeun Apr 19 '19 at 19:31