1

I need to compare two datasets:

DF1

       Subj             1           2           3
0   Biotech   Cell culture     Bioinfo  Immunology
1   Zoology   Cell culture  Immunology         NaN
2      Math   Trigonometry     Algebra         NaN
3  Microbio        Biotech         NaN         NaN
4   Physics         Optics         NaN         NaN

DF2

       Subj             1           2           
0   Biotech       Bioinfo  Immunology         
1   Zoology    Immunology      Botany                  
2  Microbio         NaN           NaN         
3   Physics        Optics  Quantumphy
4      Math  Trigonometry         NaN         

How I want my result dataframe:

       Subj             1           2          
0   Biotech       Bioinfo  Immunology         
1   Zoology    Immunology         NaN         
2      Math  Trigonometry         NaN         
3   Physics        Optics         NaN         

I can't check row by row as the datasets are huge. The number of columns varies for both datasets, but rows are the same in number. Since the order of the row elements also vary, I can't simply use merge(). I tried compare function, but it either removes all common elements or forms a dataframe containing both. I can't seem to pick out just the common elements.

2 Answers2

1

You can match columns and then set the subject column as an index while merging the dataframes:

match=df2.columns.intersection(df1.columns).tolist()
df2.merge(df1,on=match, how='left').reindex(df2.columns,axis=1).set_index('Subj').dropna(how='all')

which returns:

                    1           2
Subj                             
Biotech       Bioinfo  Immunology
Zoology    Immunology         NaN
Math     Trigonometry         NaN
Physics        Optics         NaN
  • I tried this, but it causes the row elements of the second dataframe to show up on the result dataframe if it is not present to compare with df1. – Mimikyu o_0 Nov 16 '22 at 06:03
1

here is one way to do it

Understanding: number of column varies and and values in two DF are not under same column

# Stack both the DFs, after setting Subj as index
# this results in changing a wide format to long format
# concat the two DF to forma new DF

df3=pd.concat([df.set_index('Subj').stack().reset_index().rename(columns={0:'val'}),
          df2.set_index('Subj').stack().reset_index().rename(columns={0:'val'})],
          ).reset_index()


# to find the same topic under a subject if it exists in two DFs
# the join will have duplicate rows

# so find the duplicated rows for Subj and Topic (val column)
# group the duplicated rows and aggregate to a comma separated values
# finally split on comma to create new columns

out=(df3[df3.duplicated(subset=['Subj','val'])]
 .groupby('Subj')['val']
 .agg(','.join)
 .str
 .split(',',expand=True).reset_index())
out
    Subj        0             1
0   Biotech     Bioinfo       Immunology
1   Math        Trigonometry  None
2   Physics     Optics        None
3   Zoology     Immunology    None
Naveed
  • 11,495
  • 2
  • 14
  • 21
  • I tried this, but it causes all elements of both dfs to split into two columns in the result df : For eg, Biotech comes thrice giving all values,Bioinfo, immuno and cell culture :/ – Mimikyu o_0 Nov 16 '22 at 06:07
  • 1
    @Mimikyuo_0 So it worked on provided sample but not on actual? Can you expand the example in question , such that the issue is reproduced? That helps understand what’s missing in solution. – Naveed Nov 16 '22 at 10:50
  • 1
    Okay I'm sorry! It seems I got confused. I printed out df3 instead of 'out'! Thankyou so much! This works!!! – Mimikyu o_0 Nov 16 '22 at 17:00