2

I have 2 non-indexed data frames, with the exact same data structure (same columns), for the sake of simplicity here's an example:

df1

0       1       2          3    4
JoeK    Joe     Kavanagh   120  joe.kavanagh@nomail.com
BarryD  Barry   Dempsy     11   barry.dempsy@nomail.com
OrlaF   Orla    Farrel     236  ofjk@nomail.com
SethB   Seth    Black      563  sblack@nomail.com
KateW   Kate    White      254  kw12@nomail.com

and the second one:

df2

0       1       2          3    4
JoeK    Joe     Kavanagh   110  jkavanagh@nomail.com
BarryD  Barry   Dempsy     11   barry.dempsy@nomail.com
JimmyS  Jimmy   Smith      250  j.Smith@nomail.com
SethB   Seth    Blake      563  sblack@nomail.com

I would like to have a resulting dataframes representing the rows in df1 that aren't in df2:

0       1       2          3    4
JoeK    Joe     Kavanagh   120  joe.kavanagh@nomail.com
OrlaF   Orla    Farrel     236  ofjk@nomail.com
SethB   Seth    Black      563  sblack@nomail.com
KateW   Kate    White      254  kw12@nomail.com

Notice that the first row (Joek, and SethB) exist in both dataframes, but the value of column 3 for JoeK and the value of column 2 for SethB have changed, that's why they're in the final result set.

Any help is much appreciated.

Riki Lyng
  • 317
  • 4
  • 19

1 Answers1

1

IIUC it looks like you're just trying to match on cols 0,1,2,3, you can perform a left merge and pass param indicator=True if you're using a recent version of pandas and then filter the df:

In [197]:
merged = df1.merge(df2, how='left', on=['0','1','2','3'],indicator=True)
merged[merged['_merge'] == 'left_only']

Out[197]:
       0     1         2    3                      4_x  4_y     _merge
0   JoeK   Joe  Kavanagh  120  joe.kavanagh@nomail.com  NaN  left_only
2  OrlaF  Orla    Farrel  236          ofjk@nomail.com  NaN  left_only
3  SethB  Seth     Black  563        sblack@nomail.com  NaN  left_only
4  KateW  Kate     White  254          kw12@nomail.com  NaN  left_only
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • I'm trying to match on all columns here. that was a sample, the dataframe I have has 60+ columns. So I would use on = df1.columns.tolist() – Riki Lyng Feb 11 '16 at 16:24
  • By default it will match on all columns so in your case you shouldn't need to pass the `on` param values unless you have a different requirement – EdChum Feb 11 '16 at 16:25
  • This should work `merged = df1.merge(df2, how='left', indicator=True)` I think, also you can upvote – EdChum Feb 11 '16 at 16:43