Re-indexing dataframe on index with duplicate values

Question

DataFrame df1 and df2 are defined as follows:

>>> from datetime import datetime
>>> df1 = pd.DataFrame({'date':[datetime(2020,1,6), datetime(2020,1,8), datetime(2020,1,8), datetime(2020,1,8), datetime(2020,10,13), datetime(2021,5,27), datetime(2021,6,12), datetime(2021,7,1)], 'id':[1]*8})
>>> df2 = pd.DataFrame({'date':[datetime(2020,1,6), datetime(2020,1,8), datetime(2020,1,8), datetime(2020,10,13), datetime(2020,10,13), datetime(2020,10,13), datetime(2021,5,27)], 'id':[1]*7, 'qty':[100]*7, 'name':['abd']*7})
>>> df1
         date   id
0  2020-01-06    1
1  2020-01-08    1
2  2020-01-08    1
3  2020-01-08    1
4  2020-10-13    1
5  2021-05-27    1
6  2021-06-12    1
7  2021-07-01    1
>>> df2
        date  id  qty name
0 2020-01-06   1  100  abd
1 2020-01-08   1  100  abd
2 2020-01-08   1  100  abd
3 2020-10-13   1  100  abd
4 2020-10-13   1  100  abd
5 2020-10-13   1  100  abd
6 2021-05-27   1  100  abd

Although date is not unique, I want to reindex df2 on date column of df1. Any ideas how to do so? df2.set_index('date').reindex(df1.set_index('date').index) won't work as axis have duplicate values. In this case I desire following output:

        date  id  qty name
0 2020-01-06   1  100  abd
1 2020-01-08   1  100  abd
2 2020-01-08   1  100  abd
3 2020-01-08   1  nan  nan
4 2020-10-13   1  100  abd
5 2021-05-27   1  100  abd
6 2021-06-12   1  nan  nan
7 2021-07-01   1  nan  nan

It is not necessary that df1 and df2 are of same size or have same set of dates in date column. Note: Final output is same shape as df1.

Following could be one solution using a for loop but wondering if there is anything more vectorized for bigger DataFrames:

df1.loc[:, 'qty'] = np.nan
df1.loc[:, 'name'] = np.nan
for index, row in df1.iterrows():
    df_temp = df2[df2['date'] == row['date']].head(1)
    if not df_temp.empty:
       df1.loc[index, 'qty'] = df_temp['qty'].iloc[0]
       df1.loc[index, 'name'] = df_temp['name'].iloc[0]
       df2 = df2[df2.index != df_temp.index[0]]

score 0 · Answer 1 · edited Oct 28 '21 at 15:22

0

There doesn't seem to be a one liner for the result you want but I have a logic here that produces the intended result.

df2["new_date"] = df1["date"]
for index, row in df2.iterrows():
    if row["new_date"] != row["date"]:
        df2.loc[index, "qty"] = np.nan
        df2.loc[index, "name"] = np.nan
    
df2.drop(columns=["date"], inplace=True)
df2 = df2.rename(columns={"new_date":"date"})
df2 = df2[["date", "id", "qty", "name"]]
df2

*Update: Since the dataframes can have different lengths and dates

import pandas as pd
import numpy as np
from datetime import datetime
df1 = pd.DataFrame({'new_date':[datetime(2020,1,6), datetime(2020,1,8), datetime(2020,1,8), datetime(2020,1,8), datetime(2020,10,13), datetime(2021,5,27), datetime(2021,6,12), datetime(2021,7,1)], 'new_id':[1]*8})
df2 = pd.DataFrame({'date':[datetime(2020,1,6), datetime(2020,1,8), datetime(2020,1,8), datetime(2020,10,13), datetime(2020,10,13), datetime(2020,10,13), datetime(2021,5,27)], 'id':[1]*7, 'qty':[100]*7, 'name':['abd']*7})
df1

df2

df2 = pd.concat([df2, df1], axis=1)
df2

for index, row in df2.iterrows():
    if row["new_date"] != row["date"]:
        df2.loc[index, "qty"] = np.nan
        df2.loc[index, "name"] = np.nan
        df2.loc[index, "id"] = row["new_id"]
    
df2.drop(columns=["date"], inplace=True)
df2 = df2.rename(columns={"new_date":"date"})
df2 = df2[["date", "id", "qty", "name"]]
df2

edited Oct 28 '21 at 15:22

Nimantha

6,405
6
28
69

answered Mar 28 '21 at 20:29

Ananay Mital

1,395
1
11
16

Thanks @AnanayMital for your time. Apologies for not being so clear that `df1` and `df2` can be of different size and has different dates. please see the updated examples. – Gerry Mar 28 '21 at 20:48
In the updated question, are you sure that on 27th May 2021 in the final output the values won't be `nan`? Seems like it should be `nan` – Ananay Mital Mar 28 '21 at 21:05
yes. I have put together something using for loop. But wondering if its possible to do it in more optimal way. – Gerry Mar 28 '21 at 21:17
If the `qty` and `name` values on 27th May 2021 in the final output are not `nan`, then I am afraid I don't understand the logic of what you're trying to do. I am updating my answer to work with different lengths of dataframes. – Ananay Mital Mar 28 '21 at 21:25
the logic is simple, for given date in `df1` if an entry exist in `df2` use it to get `qty` and `name`. However, do no re-use any row in `df2` twice. Check the for-loop solution that I used - it gets the job done but too slow for large DataFrames. – Gerry Mar 28 '21 at 21:34

Re-indexing dataframe on index with duplicate values

1 Answers1