1

Suppose I have a first df like this:

df1:

 item     date1         date2
   1    2020-06-21    2020-06-28
   2    2020-05-13    2020-05-24
   3    2020-06-20    2020-06-28

I also have a second df (df2) with the items, a date and a quantity

df2:

 item     quantity       date
   1         5        2020-06-24
   1         8        2020-06-20
   1         12       2020-06-27
   1         9        2020-06-29
   2         10       2020-05-24
   2         11       2020-05-15
   2         18       2020-05-18
   2         9        2020-05-14
   3         7        2020-06-18
   3         12       2020-06-21
   3         13       2020-06-24
   3         8        2020-06-28

Now I want to sum the quantities from df2 where the date is between the columns date1 and date2. So my result would look like:

df3:

 item     date1         date2        sum
   1    2020-06-21    2020-06-28     17
   2    2020-05-13    2020-05-24     48
   3    2020-06-20    2020-06-28     33

I've been starring at it for a while now and I really want to avoid a loop.

Is there an efficient way of obtaining the desired result??

  • I don't have time to type up a solid answer, but if you merge `tmp = pd.merge(df2, df1, on='item', how='left')` you'l get a single dataframe that has all three date columns on each row. You can handle selecting all the rows that have a `date` between the `date1` and `date2` a number of ways, but something like `tmp = tmp.query("(date1 < date) AND (date < date2)")` should work. Then do `tmp.groupby('item')['quantity'].sum()` to get the sums. Finally, merge that back against the original `df1` on `item` to get your answer. – Rick M Jan 28 '21 at 14:33

1 Answers1

2
df = df2.merge(df1, on = 'item', how = 'left')
df[['date', 'date1', 'date2']] = df[['date', 'date1', 'date2']].apply(pd.to_datetime)
df = df[ (df['date'] >=df['date1']) & (df['date'] <=df['date2'])]
df = df.groupby(['item','date1','date2']).agg({'quantity':'sum'}).reset_index()
  

output:

   item  date1        date2    quantity
0   1   2020-06-21  2020-06-28  17
1   2   2020-05-13  2020-05-24  48
2   3   2020-06-20  2020-06-28  33
Ajay Rawat
  • 258
  • 2
  • 6
  • thanks. Accepting the anser because it is what I asked for. The only problem I have is that df2 contains almost 100M records, so it is a bit slow. But I don't think I can avoid it. – Steven01123581321 Jan 28 '21 at 15:19
  • If your date columns are already in datetime format you can skip line 2, that may save some time. May be this can optimized further, feel free to ask new question for performance optimization. – Ajay Rawat Jan 28 '21 at 15:24
  • yes, I think I will do the first line in sql. The rest of the lines doesn't take much time. Thanks! – Steven01123581321 Jan 28 '21 at 15:42