pyjanitor conditional joins between date range based on columns

Question

data: Merging Pandas DataFrame within a specific Date Range

I want to use pyjanitor referring this post but it doesn't give if merge with another columns at the same time.

Try

df_1.conditional_join(
...     df_2,
...     ("InvoiceDate ", "PromotionStart ", ">="),
...     ("InvoiceDate ", "PromotionEnd ", "<=")
... )

output

+------------+-------------+---------------+-----------+
| LocationNo | InvoiceDate | InvoiceAmount | Promotion |
+------------+-------------+---------------+-----------+
|      A     |  01-Jul-20  |       79      |    Yes    |
+------------+-------------+---------------+-----------+
|      B     |  01-Jul-20  |       72      |           |
+------------+-------------+---------------+-----------+
|      C     |  01-Jul-20  |       24      |           |
+------------+-------------+---------------+-----------+
|      A     |  02-Jul-20  |       68      |    Yes    |
+------------+-------------+---------------+-----------+
|      B     |  02-Jul-20  |       6       |    Yes    |
+------------+-------------+---------------+-----------+
|      C     |  02-Jul-20  |       27      |           |
+------------+-------------+---------------+-----------+
|      A     |  03-Jul-20  |       25      |           |
+------------+-------------+---------------+-----------+
|      B     |  03-Jul-20  |       62      |    Yes    |
+------------+-------------+---------------+-----------+
|      C     |  03-Jul-20  |       58      |    Yes    |
+------------+-------------+---------------+-----------+
|      D     |  03-Jul-20  |       36      |    Yes    |
+------------+-------------+---------------+-----------+
|      E     |  03-Jul-20  |       65      |           |
+------------+-------------+---------------+-----------+
|      F     |  03-Jul-20  |       81      |           |
+------------+-------------+---------------+-----------+

I _think_ you're looking for something like `df_1.conditional_join(df_2, ('LocationNo', 'LocationNo', '=='), ('InvoiceDate', 'PromotionStart', '>='), ('InvoiceDate', 'PromotionEnd', '<='), how='left')` but I'm not sure why you'd use a `conditional_join` here instead of a standard merge followed by where (as in [the answer](/a/63890093/15497888) on the linked post). — Henry Ecker, Sep 20 '22 at 03:26

sammywemmy · Answer 1 · 2022-09-26T03:35:17.420

With conditional_join you need to include another tuple for the equi join:

import pandas as pd
import janitor

df_1['InvoiceDate'] = pd.to_datetime(df_1['InvoiceDate'])
df_2['PromotionStart'] = pd.to_datetime(df_2['PromotionStart'])
df_2['PromotionEnd'] = pd.to_datetime(df_2['PromotionEnd'])

(df_1
.conditional_join(
    df_2, 
    ('LocationNo', 'LocationNo', '=='), 
    ('InvoiceDate', 'PromotionStart', '>='), 
    ('InvoiceDate', 'PromotionEnd', '<='))
)
        left                                right
  LocationNo InvoiceDate InvoiceAmount LocationNo PromotionStart PromotionEnd
0          A  2020-07-01            79          A     2020-07-01   2020-07-02
1          A  2020-07-02            68          A     2020-07-01   2020-07-02
2          B  2020-07-02             6          B     2020-07-02   2020-07-03
3          B  2020-07-03            62          B     2020-07-02   2020-07-03
4          C  2020-07-03            58          C     2020-07-03   2020-07-05
5          D  2020-07-03            36          D     2020-07-01   2020-07-05

As @HenryEcker rightly pointed out, your question can be solved by merging, then filtering. For your use case conditional_join might be overkill; for equi joins, what conditional_join does is to intercept the indices that are generated from pandas internal merge function (a hash implementation), and then run the non-equi joins, before creating the final dataframe. No sorting is done, unlike the strictly non equi joins, as hash merge are usually faster for equi joins (caveat - R's data.table uses some form of binary search for its joins and is quite fast, usually even faster than Pandas). For large dataframes, it might come in handy with some performance gains; your mileage might vary.

Let's look at a silly example below with a million rows - the code below is based on the dev version:

# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git

np.random.seed(3)
df = pd.DataFrame({'start':np.random.randint(100_000, size=1_000_000),
                   'end':np.random.randint(100_000, size=1_000_000)})
dd = pd.DataFrame({'ID':np.random.randint(100_000, size=1_500_000)})

df.head()
   start    end
0  71530  85703
1  67224  37802
2  77049    652
3  59011  99059
4  48056  26108

dd.head()
      ID
0  25816
1  92958
2  62607
3  89684
4  13434


%timeit df.merge(dd, left_on='start', right_on = 'ID').loc[lambda df: df.end >= df.ID]
1.27 s ± 17.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df.conditional_join(dd, ('start', 'ID', '=='), ('end', 'ID', '>='))
597 ms ± 16.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


# verify both dataframes are the same
cond_join = df.conditional_join(dd, ('start', 'ID', '=='), ('end', 'ID', '>='))

pd_merge = df.merge(dd, left_on='start', right_on = 'ID').loc[lambda df: df.end >= df.ID]

pd_merge.reset_index(drop=True).equals(cond_join)
True


# memory consumption
In [167]: %load_ext memory_profiler

In [168]: %memit df.conditional_join(dd, ('start', 'ID', '=='), ('end', 'ID', '>='))
peak memory: 1599.06 MiB, increment: 173.16 MiB

In [169]: %memit df.merge(dd, left_on='start', right_on = 'ID').loc[lambda df: df.end >= df.ID]
peak memory: 2207.29 MiB, increment: 624.08 MiB

pyjanitor conditional joins between date range based on columns

1 Answers1