DataFrame df1
and df2
are defined as follows:
>>> from datetime import datetime
>>> df1 = pd.DataFrame({'date':[datetime(2020,1,6), datetime(2020,1,8), datetime(2020,1,8), datetime(2020,1,8), datetime(2020,10,13), datetime(2021,5,27), datetime(2021,6,12), datetime(2021,7,1)], 'id':[1]*8})
>>> df2 = pd.DataFrame({'date':[datetime(2020,1,6), datetime(2020,1,8), datetime(2020,1,8), datetime(2020,10,13), datetime(2020,10,13), datetime(2020,10,13), datetime(2021,5,27)], 'id':[1]*7, 'qty':[100]*7, 'name':['abd']*7})
>>> df1
date id
0 2020-01-06 1
1 2020-01-08 1
2 2020-01-08 1
3 2020-01-08 1
4 2020-10-13 1
5 2021-05-27 1
6 2021-06-12 1
7 2021-07-01 1
>>> df2
date id qty name
0 2020-01-06 1 100 abd
1 2020-01-08 1 100 abd
2 2020-01-08 1 100 abd
3 2020-10-13 1 100 abd
4 2020-10-13 1 100 abd
5 2020-10-13 1 100 abd
6 2021-05-27 1 100 abd
Although date
is not unique, I want to reindex df2
on date
column of df1
. Any ideas how to do so?
df2.set_index('date').reindex(df1.set_index('date').index)
won't work as axis have duplicate values. In this case I desire following output:
date id qty name
0 2020-01-06 1 100 abd
1 2020-01-08 1 100 abd
2 2020-01-08 1 100 abd
3 2020-01-08 1 nan nan
4 2020-10-13 1 100 abd
5 2021-05-27 1 100 abd
6 2021-06-12 1 nan nan
7 2021-07-01 1 nan nan
It is not necessary that df1
and df2
are of same size or have same set of dates in date
column. Note: Final output is same shape as df1
.
Following could be one solution using a for loop but wondering if there is anything more vectorized for bigger DataFrames:
df1.loc[:, 'qty'] = np.nan
df1.loc[:, 'name'] = np.nan
for index, row in df1.iterrows():
df_temp = df2[df2['date'] == row['date']].head(1)
if not df_temp.empty:
df1.loc[index, 'qty'] = df_temp['qty'].iloc[0]
df1.loc[index, 'name'] = df_temp['name'].iloc[0]
df2 = df2[df2.index != df_temp.index[0]]