0

I am trying to do a simple intersection of dataframes, but no matter the method (pd.merge or pd.concat) I always endup with duplicated columns.

Say the two dataframes are:

df1 = pd.DataFrame(
    data = np.stack([
        np.arange(4),
        np.random.rand(4),
        np.random.rand(4),
    ],axis=-1),
    columns = ["A","B","C"]
)
df2 = pd.DataFrame(
    data = np.stack([
        np.arange(4)+2,
        np.random.rand(4),
        np.random.rand(4),
    ],axis=-1),
    columns = ["A","B","C"]
)

df1,df2
(     A         B         C
 0  0.0  0.165849  0.516438
 1  1.0  0.116661  0.395433
 2  2.0  0.758229  0.386642
 3  3.0  0.057719  0.802317,
      A         B         C
 0  2.0  0.899924  0.545800
 1  3.0  0.792242  0.187410
 2  4.0  0.332168  0.778160
 3  5.0  0.164429  0.509884)

The result I want to achieve is this:

    Source     A         B         C 
 2  df1        2.0  0.758229  0.386642
 3  df1        3.0  0.057719  0.802317
 0  df2        2.0  0.899924  0.545800
 1  df2        3.0  0.792242  0.187410

I tried with pd.concat, but I have duplicated columns:

# Try with pd.concat inner
for df in [df1,df2]:
    df.set_index("A",inplace=True)

df = pd.concat([df1,df2],join="inner",axis=1).reset_index()
df
    A   B   C   B   C
0   2.0 0.758229    0.386642    0.899924    0.54580
1   3.0 0.057719    0.802317    0.792242    0.18741

And pd.merges also gives duplicated columns:

df = pd.merge(
    left = df1,
    right = df2,
    how="inner",
    on = "A"
)
df
    A   B_x C_x B_y C_y
0   2.0 0.758229    0.386642    0.899924    0.54580
1   3.0 0.057719    0.802317    0.792242    0.18741

Anyone has a clean way to do this in pandas ?

Thanks

carlitador
  • 47
  • 9

1 Answers1

0

You can use isin and then concat:

pd.concat([df1[df1.A.isin(df2.A)].assign(source='df1'),
           df2[df2.A.isin(df1.A)].assign(source='df2')])

Example:

import pandas as pd

df1 = pd.DataFrame({'A': [0,1,2,3], 'B': [1,2,3,4], 'C': [5,6,7,8]})
df2 = pd.DataFrame({'A': [2,3,4,5], 'B': [11,12,13,14], 'C': [15,16,17,18]})
pd.concat([df1[df1.A.isin(df2.A)].assign(source='df1'), df2[df2.A.isin(df1.A)].assign(source='df2')])

Result:

   A   B   C source
2  2   3   7    df1
3  3   4   8    df1
0  2  11  15    df2
1  3  12  16    df2
Stef
  • 28,728
  • 2
  • 24
  • 52
  • Thank you, it works. However I would need something general so I can use it for n >> 1 dataframes. – carlitador Jul 12 '22 at 07:15
  • for n > 2 dataframes you can first get the intersecting values of columns `A` of all dataframes in a loop using python `set` operations or numpy `intersect1d` and then concat the dataframes sliced with this intersection. – Stef Jul 12 '22 at 10:19