2

I've 2 DataFrames. One for Invoice Data and the other for Promotion Dates.

Invoice Data DataFrame

+------------+-------------+---------------+
| LocationNo | InvoiceDate | InvoiceAmount |
+------------+-------------+---------------+
|      A     |  01-Jul-20  |       79      |
+------------+-------------+---------------+
|      B     |  01-Jul-20  |       72      |
+------------+-------------+---------------+
|      C     |  01-Jul-20  |       24      |
+------------+-------------+---------------+
|      A     |  02-Jul-20  |       68      |
+------------+-------------+---------------+
|      B     |  02-Jul-20  |       6       |
+------------+-------------+---------------+
|      C     |  02-Jul-20  |       27      |
+------------+-------------+---------------+
|      A     |  03-Jul-20  |       25      |
+------------+-------------+---------------+
|      B     |  03-Jul-20  |       62      |
+------------+-------------+---------------+
|      C     |  03-Jul-20  |       58      |
+------------+-------------+---------------+
|      D     |  03-Jul-20  |       36      |
+------------+-------------+---------------+
|      E     |  03-Jul-20  |       65      |
+------------+-------------+---------------+
|      F     |  03-Jul-20  |       81      |
+------------+-------------+---------------+

df_1 = pd.DataFrame({
    'LocationNo':['A','B','C','A','B','C','A','B','C','D','E','F'],
    'InvoiceDate':['01-Jul-20','01-Jul-20','01-Jul-20','02-Jul-20','02-Jul-20','02-Jul-20',
                  '03-Jul-20','03-Jul-20','03-Jul-20','03-Jul-20','03-Jul-20','03-Jul-20'],
    'InvoiceAmount':[79,72,24,68,6,27,25,62,58,36,65,81]
})

Promotion Dates DataFrame

+------------+----------------+--------------+
| LocationNo | PromotionStart | PromotionEnd |
+------------+----------------+--------------+
|      A     |    01-Jul-20   |   02-Jul-20  |
+------------+----------------+--------------+
|      B     |    02-Jul-20   |   03-Jul-20  |
+------------+----------------+--------------+
|      C     |    03-Jul-20   |   05-Jul-20  |
+------------+----------------+--------------+
|      D     |    01-Jul-20   |   05-Jul-20  |
+------------+----------------+--------------+
|      E     |    02-Jul-20   |   02-Jul-20  |
+------------+----------------+--------------+
|      F     |    05-Jul-20   |   06-Jul-20  |
+------------+----------------+--------------+

df_2 = pd.DataFrame({
    'LocationNo' : ['A','B','C','D','E','F'],
    'PromotionStart':['01-Jul-20','02-Jul-20','03-Jul-20','01-Jul-20','02-Jul-20','05-Jul-20'],
    'PromotionEnd':['02-Jul-20','03-Jul-20','05-Jul-20','05-Jul-20','02-Jul-20','06-Jul-20',]
})

My Task is to Merge both the DataFrames based on 2 conditions & add value Yes to a new column Promotion if these conditions are met.

conditions are as follows

  1. LocationNo should match
  2. The InvoiceDate should be between the respective LocationNo's PromotionStart and PromotionEnd.

Basically, to merge only If the InvoiceDate falls between PromotionStart and PromotionEnd dates.

My desired output is as below

+------------+-------------+---------------+-----------+
| LocationNo | InvoiceDate | InvoiceAmount | Promotion |
+------------+-------------+---------------+-----------+
|      A     |  01-Jul-20  |       79      |    Yes    |
+------------+-------------+---------------+-----------+
|      B     |  01-Jul-20  |       72      |           |
+------------+-------------+---------------+-----------+
|      C     |  01-Jul-20  |       24      |           |
+------------+-------------+---------------+-----------+
|      A     |  02-Jul-20  |       68      |    Yes    |
+------------+-------------+---------------+-----------+
|      B     |  02-Jul-20  |       6       |    Yes    |
+------------+-------------+---------------+-----------+
|      C     |  02-Jul-20  |       27      |           |
+------------+-------------+---------------+-----------+
|      A     |  03-Jul-20  |       25      |           |
+------------+-------------+---------------+-----------+
|      B     |  03-Jul-20  |       62      |    Yes    |
+------------+-------------+---------------+-----------+
|      C     |  03-Jul-20  |       58      |    Yes    |
+------------+-------------+---------------+-----------+
|      D     |  03-Jul-20  |       36      |    Yes    |
+------------+-------------+---------------+-----------+
|      E     |  03-Jul-20  |       65      |           |
+------------+-------------+---------------+-----------+
|      F     |  03-Jul-20  |       81      |           |
+------------+-------------+---------------+-----------+

I know how to merge columns based on multiple conditions. But, in this case, I need to merge within Date Range.

kindly, let me know how should i approach to solve this.

Tommy
  • 515
  • 4
  • 13

1 Answers1

3

Let's try merge and filter:

out = df_1.merge(df_2, on='LocationNo', how='left')

df_1['Promotion'] = np.where(out['InvoiceDate'].between(out['PromotionStart'], out['PromotionEnd']),
                             'Yes', '')

Output:

   LocationNo InvoiceDate  InvoiceAmount Promotion
0           A   01-Jul-20             79       Yes
1           B   01-Jul-20             72          
2           C   01-Jul-20             24          
3           A   02-Jul-20             68       Yes
4           B   02-Jul-20              6       Yes
5           C   02-Jul-20             27          
6           A   03-Jul-20             25          
7           B   03-Jul-20             62       Yes
8           C   03-Jul-20             58       Yes
9           D   03-Jul-20             36       Yes
10          E   03-Jul-20             65          
11          F   03-Jul-20             81          
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • 1
    The problem with this approach is that it won't be practical for large datasets since the merge might create thousands of df_1 duplications – Niv Cohen Aug 12 '21 at 12:44