What's a more efficient way to merge rows from DataFrames row-by-row with conditions?

Question

I'm joining two tables with data from two systems. A simple Pandas merge between two df won't honor more complex rules (unless I'm using it wrong, don't understand the process merge is implementing--very possible).

I've cobbled together a toy solution that lets me unpack two df's with itertuples, validate matches based on values, and then repack into one dataframe:

df1:            df2:
   A   X           B   Y
0  1  10        0  2  10
1  5  15        1  4  15
                2  6  15

.

df1 = pd.DataFrame(data1,columns=['A','X'])
df2 = pd.DataFrame(data2,columns=['B','Y'])
df3 = pd.DataFrame(index=['A','X','B','Y'])
i = -1

for rowA in df1.itertuples(index=False):
    i += 1
    for rowB in df2.itertuples(index=False):
        A,X = rowA
        B,Y = rowB
        if (B > A) & (X==Y):
            df3[i] = list(rowA+rowb)
        else:
            continue

print(df3.transpose())

.

   A   X  B   Y
0  1  10  2  10
1  5  15  6  15

My naive approach is inefficient

The nested for() loop is inefficient because I'm iterating over data2/df2 for each entry of data1. Once I get a good match with data2/df2, the row should be removed.

//UPDATE (show the origin of my question)

An example of the type of data I'm working with merges two independent systems which do not share any keys or other serialized IDs. Since I can't make an exact match, I must rely on logical/arithmetic operations and the process of elimination.

In the following example a simple pandas.merge fails on Line3, because the Time1 < Time2.

   Time1,               Total1 ... Time2,               Total2, error
1, 2017-02-19 08:03:00, 15.00  ... 2017-02-19 08:02:00,  15.00, 0
2, 2017-02-19 08:28:00, 33.00  ... 2017-02-19 08:27:00,  33.00, 0
3, 2017-02-19 08:40:00, 20.00  ... 2017-02-19 10:06:00,  20.00, 1
4, 2017-02-19 10:08:00, 20.00  ... 2017-02-19 10:16:00,  20.00, 1
[...]

What should happen is something like this:

   Time1,               Total1 ... Time2,               Total2, error
1, 2017-02-19 08:03:00, 15.00  ... 2017-02-19 08:02:00,  15.00, 0
2, 2017-02-19 08:28:00, 33.00  ... 2017-02-19 08:27:00,  33.00, 0
3, 2017-02-19 08:40:00, 20.00  ... NaN,                  NaN,   NaN
4, 2017-02-19 10:08:00, 20.00  ... 2017-02-19 10:06:00,  20.00, 0
[...]

// UPDATE2 I've worked on several permutations of merge_asof() and join() recommended in answers. Each method was also sorted as directed by docs. Assuming I've implemented each correctly, the following percentages are True matches of the rules ((time1>=time2) & (Total1==Total2) out of 53 records) in my test set using each of three methods:

| type                  | 'date'   | 'total'   | both   |
|-----------------------|----------|-----------|--------|
| merg_asof sort (time) | .7924    | .9245     | .7169  |
| merg_asof (time,total)| .7735    | .6981     | .6226  |
| intertup (time,total) | .8301    | .8301     | .8301  |
| join ind (time)       | na       | na        | na     |

The join required a shared key, right? the on clause in the documentation states, "Column(s) in the caller to join on the index in other, otherwise joins index-on-index. If multiples columns given, the passed DataFrame must have a MultiIndex."

I tried join with a multi-index of (time,total) and just (time). The problem is, the join clobbers whatever you join on. There's nothing left to perform the error analysis on because those indexes are merged into one.

My naive intertuple solution (above) produced only perfect matches, but the solution still needs a collector for missed matches.

I've read this question multiple times and I'm still trying to understand where things like "total" come from. Given your original (pre-update) question, I would just do something like `pd.merge_asof(df1, df2, left_on='A', right_on='B', left_by='X', right_by='Y', allow_exact_matches=False, direction='forward')`. Note that the `direction` parameter won't come-out until pandas 0.20.0, so you'll have to build pandas from source or sit tight for now. — chrisaycock, Mar 21 '17 at 19:32
@chrisaycock, you have to use your imagination a bit to understand the relation between my naive algorithm and the original data I paraphrase beiefly above. Total was just that--the total of a single transaction propagating between unintegrated transaction systems. To make my code easier to follow I chose A/B as ordered and X/Y as the match. `merge_asof()` was not coming up in my searches before. `direction` parameter is very important, and I just noticed the `tolerance` parameter, make `merge_asof` a great candidate for a solution. More tests. — xtian, Mar 22 '17 at 02:09

score 0 · Answer 1 · answered Mar 15 '17 at 02:28

0

df3=df1.join(df2) does not do what you want?

answered Mar 15 '17 at 02:28

dartdog

10,432
21
72
121

Nope. In the above example, a simple join would give (5,15) & (4,15), but this fails rule 5<4. – xtian Mar 15 '17 at 10:17

score 0 · Answer 2 · answered Mar 19 '17 at 14:56

If I'm understanding your logic correctly, this should do it:

time1 = pd.to_datetime(['2/19/17 8:03:00', '2/19/17 8:28:00', '2/19/17 8:40:00', '2/19/17 10:08:00'])
time2 = pd.to_datetime(['2/19/17 8:02:00', '2/19/17 8:27:00', '2/19/17 10:06:00', '2/19/17 10:16:00'])

df1 = pd.DataFrame({'Time1':time1, 'Total1':[15.00, 33.00, 20.00, 20.00]})
df2 = pd.DataFrame({'Time2':time2, 'Total2':[15.00, 33.00, 20.00, 20.00], 'error':[0,0,1,1]})

df3 = pd.merge_asof(df1, df2, left_on = 'Time1', right_on = 'Time2')
df3.loc[df3['Time2'].duplicated(), ['Time2', 'Total2', 'error']] = None

Output:

                Time1  Total1               Time2  Total2  error
0 2017-02-19 08:03:00    15.0 2017-02-19 08:02:00    15.0    0.0
1 2017-02-19 08:28:00    33.0 2017-02-19 08:27:00    33.0    0.0
2 2017-02-19 08:40:00    20.0                 NaT     NaN    NaN
3 2017-02-19 10:08:00    20.0 2017-02-19 10:06:00    20.0    1.0

TLDR: jury is still deliberating; I've added some error checking results above. It's rather odd how `merge_asof()` made many good total matches, but the errors in `time` matching, brought overall good matches down. Today I did the heavy work of getting each merge type to run; I have to return to the project later to review my methodology--make sure there are no mistakes. — xtian, Mar 20 '17 at 21:24

What's a more efficient way to merge rows from DataFrames row-by-row with conditions?

2 Answers2

Linked