2

I have 2 dataframes df1 and df2 which have the same number of rows but in a different ordering. The dataframes have an ID column which contains a unique identifier for each row, both dataframes have this ID column and this is the column for which I would like the order of the rows in df2 to match in df1. The reason I want the order to be the same is that the dataframes will be passed to sklearn train_test_split and I wish to split the dataframes in the exact same order. How can I do this?

Here is the first dataframe:

df1

Paper ID          Document       Label 
math.123.txt    df1 content 1      0
math.321.txt    df1 content 2      0
math.213.txt    df1 content 3      0

df2

Paper ID         Document        Label 
math.123.txt     df2 content 1         0
math.213.txt     df2 content 2         0
math.321.txt     df2 content 3         0

desired order of df2:

df2

Paper ID         Document            Label 
math.123.txt     df2 content 1         0
math.321.txt     df2 content 3         0
math.213.txt     df2 content 2         0

So essentially I just want to reorder the rows in df2 based on the order of rows in df1["Paper ID"]

Zein
  • 33
  • 7
  • `df2.loc[df1.index]` or `df2.reindex(df1.index)` if `df2` is missing some labels – Paul H Aug 06 '21 at 17:10
  • I tried both but when I try doing an equality check ```df1["Paper ID"].equals(df2["Paper ID"])``` I get false – Zein Aug 06 '21 at 17:13
  • I just checked df2 and there are rows with nan values in all columns after doing reindex, what is the reason for this? – Zein Aug 06 '21 at 17:19
  • is `Paper ID` your index? that's what I was assuming. Sounds like you need to use the `.sort_values` method – Paul H Aug 06 '21 at 17:32
  • The order of the rows in df1 based on Paper ID is what I am considering to be my index – Zein Aug 06 '21 at 17:38
  • but is it actually the index of the dataframe? in my original proposal, it needs to be – Paul H Aug 06 '21 at 17:57

2 Answers2

8

Since you want to order the dataframes according to the Paper ID, you should first set them as the index in both dataframes:

df1.set_index('Paper ID', inplace=True)
df2.set_index('Paper ID', inplace=True)

Now you can reindex df2 to match the order of df1:

df2 = df2.reindex(df1.index)

Finally, reset the indices to restore the default index:

df1.reset_index(inplace=True)
df2.reset_index(inplace=True)
afsharov
  • 4,774
  • 2
  • 10
  • 27
3

You can use reindex using 'Paper ID' as index, here is a pipeline version:

df2 = (df2.set_index('Paper ID')
          .reindex(df1.set_index('Paper ID').index)
          .reset_index()
       )

Or, if your values are unique, you can try to perform a left join on a df1 with only the key column (untested):

pd.merge(df1[['Paper ID']], df2,
         how='left', on='Paper ID',
         sort=False
         )
mozway
  • 194,879
  • 13
  • 39
  • 75