Compare columns in Pandas between two unequal size Dataframes for condition check

Question

I have two pandas DF. Of unequal sizes. For example :

Df1
id     value
a      2
b      3
c      22
d      5 

Df2 
id     value
c      22
a      2

No I want to extract from DF1 those rows which has the same id as in DF2. Now my first approach is to run 2 for loops, with something like :

x=[]
for i in range(len(DF2)):
    for j in range(len(DF1)):
        if DF2['id'][i] == DF1['id'][j]:
          x.append(DF1.iloc[j])

Now this is okay, but for 2 files of 400,000 lines in one and 5,000 in another, I need an efficient Pythonic+Pnadas way

Use `join` to join the datasets on `id` column. After join You will get some rows with NaN values - you should use `dropna` method. Then filter joined data - probably with `pd.apply` method. — 404pio, Feb 11 '20 at 10:16
HI @404pio this suggestion worked for me wonderfully. I think if I could, I would have made this the answer — Ayan Mitra, Feb 20 '20 at 10:57
You can formulate answer to your question: https://stackoverflow.com/help/self-answer - it is kind of "knowledge base". — 404pio, Feb 20 '20 at 11:31

score 2 · Accepted Answer · answered Feb 11 '20 at 12:30

import pandas as pd

data1={'id':['a','b','c','d'],
       'value':[2,3,22,5]}

data2={'id':['c','a'],
       'value':[22,2]}

df1=pd.DataFrame(data1)
df2=pd.DataFrame(data2)
finaldf=pd.concat([df1,df2],ignore_index=True)

Output after concat

Final Ouput

finaldf.drop_duplicates()

    id  value
0   a   2
1   b   3
2   c   22
3   d   5

anky · Answer 2 · 2020-02-11T11:03:22.217

1

You can concat the dataframes , then check if all the elements are duplicated or not , then drop_duplicates and keep just the first occurrence:

m = pd.concat((df1,df2))
m[m.duplicated('id',keep=False)].drop_duplicates()

  id  value
0  a      2
2  c     22

edited Feb 11 '20 at 11:03

answered Feb 11 '20 at 10:56

anky

74,114
11
41
70

score 1 · Answer 3 · answered Feb 11 '20 at 11:08

1

You can try this:

df = df1[df1.set_index(['id']).index.isin(df2.set_index(['id']).index)]

answered Feb 11 '20 at 11:08

python_user

196
1
5
15

Compare columns in Pandas between two unequal size Dataframes for condition check

3 Answers3