-1

I work in python and pandas.

Let's suppose that I have a dataframe like that (INPUT):

    A   B   C
0   2   8   6
1   5   2   5
2   3   4   9
3   5   1   1

I want to process it to finally get a new dataframe which looks like that (EXPECTED OUTPUT):

    A   B   C
0   2   7   NaN
1   5   1   1
2   3   3   NaN
3   5   0   NaN

To manage this I do the following:

columns = ['A', 'B', 'C']
data_1 = [[2, 5, 3, 5], [8, 2, 4, 1], [6, 5, 9, 1]]
data_1 = np.array(data_1).T
df_1 = pd.DataFrame(data=data_1, columns=columns)

df_2 = df_1
df_2['B'] -= 1
df_2['C'] = np.nan

df_2 looks like that for now:

    A   B   C
0   2   7   NaN
1   5   1   NaN
2   3   3   NaN
3   5   0   NaN

Now I want to do a matching/merging between df_1 and df_2 with using as keys the columns A and B.

I tried with isin() to do this:

df_temp = df_1[df_1[['A', 'B']].isin(df_2[['A', 'B']])]
df_2.iloc[df_temp.index] = df_temp

but it gives me back the same df_2 as before without matching the common row 5 1 1 for A, B, C respectively:

    A   B   C
0   2   7   NaN
1   5   1   NaN
2   3   3   NaN
3   5   0   NaN

How can I do this properly?

By the way, just to be clear, the matching should not be done like

  • 1st row of df1 - 1st row of df1
  • 2nd row of df1 - 2nd row of df2
  • 3rd row of df1 - 3rd row of df2
  • ...

But it has to be done as:

  • any row of df1 - any row of df2

based on the specified columns as keys.

I think that this is why isin() above at my code does not work since it does the filtering/matching in the former way.

On the other hand, .merge() can do the matching in the latter way but it does not preserve the order of the rows in the way I want and it is pretty tricky or inefficient to fix that.

Finally, keep in mind that with my actual dataframes way more than only 2 columns (e.g. 15) will be used as keys for the matching so it is better that you come up with something concise even for bigger dataframes.

P.S.

See my answer below.

Outcast
  • 4,967
  • 5
  • 44
  • 99
  • @G.Anderson, yes but the matching should not be done like 1st row of df1 - 1st row of df1, 2nd row of df1 - 2nd row of df2, 3rd row of df1 - 3rd row of df2, ... But it has to be done as any row of df1 - any row of df2 based on the specified columns. Does this help? It is more like a merging but with the order of the rows preserved in the way I show above. – Outcast Jul 17 '19 at 15:48
  • What's your expected output? – Quang Hoang Jul 17 '19 at 15:55

4 Answers4

0

Here's my suggestion using a lambda function in apply. Should be easily scalable to more columns to compare (just adjust cols_to_compare accordingly). By the way, when generating df_2, be sure to copy df_1, otherwise changes in df_2 will carry over to df_1 as well. So generating the data first:

columns = ['A', 'B', 'C']
data_1 = [[2, 5, 3, 5], [8, 2, 4, 1], [6, 5, 9, 1]]
data_1 = np.array(data_1).T
df_1 = pd.DataFrame(data=data_1, columns=columns)

df_2 = df_1.copy()  # Be sure to create a copy here
df_2['B'] -= 1
df_2['C'] = np.nan

an now we 'scan' df_1 for the rows of interest:

cols_to_compare = ['A', 'B']
df_2['C'] = df_2.apply(lambda x: 1 if any((df_1.loc[:, cols_to_compare].values[:]==x[cols_to_compare].values).all(1)) else np.nan, axis=1)

What is does is check whether the values in the current row are also like this in any row in the concerning columns of df_1. The output is:

   A  B    C
0  2  7  NaN
1  5  1  1.0
2  3  3  NaN
3  5  0  NaN
bexi
  • 1,186
  • 5
  • 9
0

Someone (I do not remember his username) suggested the following (which I think works) and then he deleted his post for some reason (??!):

df_2=df_2.set_index(['A','B'])

temp = df_1.set_index(['A','B'])

df_2.update(temp)

df_2.reset_index(inplace=True)
Outcast
  • 4,967
  • 5
  • 44
  • 99
-1

You can accomplish this using two for loops:

for row in df_2.iterrows():
    for row2 in df_1.iterrows():
        if [row[1]['A'],row[1]['B']] == [row2[1]['A'],row2[1]['B']]:
            df_2['C'].iloc[row[0]] = row2[1]['C']
Akaisteph7
  • 5,034
  • 2
  • 20
  • 43
-1

Just modify your below line:

df_temp = df_1[df_1[['A', 'B']].isin(df_2[['A', 'B']])]

with:

df_1[df_1['A'].isin(df_2['A']) & df_1['B'].isin(df_2['B'])]

It works fine!!

hacker315
  • 1,996
  • 2
  • 13
  • 23