7

Now i need to merge two dataframe with the condition greater than(>=). But merge only support equal. Is there any way to deal with it? Thanks!

J.Bao
  • 71
  • 1
  • 2
  • 3
    I don't think that's true, you can index and slice dataframes using any condition you want. Can you explain your question? How does your frames look, how do you want them to look etc.. – umutto Mar 28 '17 at 02:02
  • Thanks for your reply. For example, I have dataframe A and B, I need to inner merge them with the condition that A.Column1 >= B.Column1 and A.Column1 <= B.Column2. In SQL this is very easy: select * from A inner join B on A.Column1 >= B.Column1 and A.Column1 <= B.Column2, but in pandas I can't find any way to deal with it:( – J.Bao Mar 28 '17 at 08:27
  • In newer versions of pandas an interval index could be helpful – sammywemmy Oct 10 '21 at 12:19

2 Answers2

5

I don't know how to achieve the following with similar merge and join syntax in pandas,

SELECT * 
FROM a 
INNER JOIN b 
ON a.column1 >= b.column1 AND a.column1 <= b.column2 

But the query above can also be written implicitly as;

SELECT * 
FROM a, b 
WHERE a.column1 >= b.column1 AND a.column1 <= b.column2 

Which is basically the old syntax and should do exactly same (performance wise). It takes the cartesian product of 2 tables (or cross join) and then select from that using the WHERE condition, which could be easily implemented in pandas. This could be a little heavy on memory, but should be fast.

First the FROM a, b clause (we temporarily assign a column with same values in all rows, so we can cross join over it);

df = pd.merge(a.assign(key=0), b.assign(key=0), on='key').drop('key', axis=1)

and then use boolean indexing (our WHERE clause) to slice the frame;

df[(df["column1_x"] >= df["column1_y"]) & (df["column1_x"] <= df["column2_y"])]

If you don't want the cartesian product and only want to compare the rows on same index of both tables, you can merge on index like this;

df = a.merge(b, left_index = True, right_index = True)

or concat on axis 1 if they are same length;

df = pd.concat([a, b], axis=1)

And use boolean indexing again to eliminate results;

df[(df["column1_x"] >= df["column1_y"]) & (df["column1_x"] <= df["column2_y"])]
umutto
  • 7,460
  • 4
  • 43
  • 53
  • simple, ellegant and well-explained solution, though it can be slightly memory-overheading. – enoted Sep 14 '20 at 16:38
1

pd.merge_asof() does exactly what you ask for, right?

events is a dataframe that contains certain events - and ranges is another dataframe containing timeframes where you want to put your stuff

import pandas as pd

events = pd.DataFrame([[1], [5], [8], [10]], 
                      columns=['event_time'])
ranges = pd.DataFrame([[0, 5], [5, 7], [7, 10]], 
                      columns=['start_time', 'end_time'])

df = pd.merge_asof(events,
                   ranges,
                   left_on='event_time',
                   right_on='start_time',
                   allow_exact_matches=True)

Here how it looks

print(events)
   event_time
0           1
1           5
2           8
3          10


print(ranges)
   start_time  end_time
0           0         5
1           5         7
2           7        10


print(df)
   event_time  start_time  end_time
0           1           0         5
1           5           5         7
2           8           7        10
3          10           7        10
MattiH
  • 554
  • 5
  • 9