Reordering rows in a dataframe to match order of rows in another dataframe

Question

I have 2 dataframes df1 and df2 which have the same number of rows but in a different ordering. The dataframes have an ID column which contains a unique identifier for each row, both dataframes have this ID column and this is the column for which I would like the order of the rows in df2 to match in df1. The reason I want the order to be the same is that the dataframes will be passed to sklearn train_test_split and I wish to split the dataframes in the exact same order. How can I do this?

Here is the first dataframe:

df1

Paper ID          Document       Label 
math.123.txt    df1 content 1      0
math.321.txt    df1 content 2      0
math.213.txt    df1 content 3      0

df2

Paper ID         Document        Label 
math.123.txt     df2 content 1         0
math.213.txt     df2 content 2         0
math.321.txt     df2 content 3         0

desired order of df2:

df2

Paper ID         Document            Label 
math.123.txt     df2 content 1         0
math.321.txt     df2 content 3         0
math.213.txt     df2 content 2         0

So essentially I just want to reorder the rows in df2 based on the order of rows in df1["Paper ID"]

`df2.loc[df1.index]` or `df2.reindex(df1.index)` if `df2` is missing some labels — Paul H, Aug 06 '21 at 17:10
I tried both but when I try doing an equality check ```df1["Paper ID"].equals(df2["Paper ID"])``` I get false — Zein, Aug 06 '21 at 17:13
I just checked df2 and there are rows with nan values in all columns after doing reindex, what is the reason for this? — Zein, Aug 06 '21 at 17:19
is `Paper ID` your index? that's what I was assuming. Sounds like you need to use the `.sort_values` method — Paul H, Aug 06 '21 at 17:32
The order of the rows in df1 based on Paper ID is what I am considering to be my index — Zein, Aug 06 '21 at 17:38
but is it actually the index of the dataframe? in my original proposal, it needs to be — Paul H, Aug 06 '21 at 17:57

score 8 · Accepted Answer · answered Aug 06 '21 at 23:13

8

Since you want to order the dataframes according to the Paper ID, you should first set them as the index in both dataframes:

df1.set_index('Paper ID', inplace=True)
df2.set_index('Paper ID', inplace=True)

Now you can reindex df2 to match the order of df1:

df2 = df2.reindex(df1.index)

Finally, reset the indices to restore the default index:

df1.reset_index(inplace=True)
df2.reset_index(inplace=True)

answered Aug 06 '21 at 23:13

afsharov

4,774
2
10
27

I agree, probably the simplest (+1), however you don't need to reset the index for both dataframes in place, see my answer. – mozway Aug 07 '21 at 03:05
Amazing, and elegant solution, thank you! – Zein Aug 07 '21 at 16:47

score 3 · Answer 2 · answered Aug 07 '21 at 03:02

3

You can use reindex using 'Paper ID' as index, here is a pipeline version:

df2 = (df2.set_index('Paper ID')
          .reindex(df1.set_index('Paper ID').index)
          .reset_index()
       )

Or, if your values are unique, you can try to perform a left join on a df1 with only the key column (untested):

pd.merge(df1[['Paper ID']], df2,
         how='left', on='Paper ID',
         sort=False
         )

answered Aug 07 '21 at 03:02

mozway

194,879
13
39
75

I see, the method chaining for reindexing actually accounts for a slicker solution. +1 – afsharov Aug 07 '21 at 07:36
1

Also a great solution, thank you both @mozway @afsharov! – Zein Aug 07 '21 at 16:48

Reordering rows in a dataframe to match order of rows in another dataframe

2 Answers2