4

I have two dataframes like this:

import pandas as pd
import numpy as np

df1 = pd.DataFrame(
    {
        'A': list('abdcde'),
        'B': ['s', np.nan, 'h', 'j', np.nan, 'g']
    }
)

df2 = pd.DataFrame(
    {
        'mapcol': list('abpppozl')
    }
)

   A    B
0  a    s
1  b  NaN
2  d    h
3  c    j
4  d  NaN
5  e    g

  mapcol
0      a
1      b
2      p
3      p
4      p
5      o
6      z
7      l

I would now like to fill B in df1 using the values of df2['mapcol'], however not using the actual index but - in this case - just the first two entries of df2['mapcol']. So, instead of b and p that correspond to index 1 and 4, respectively, I would like to use the values a and b.

One way of doing it would be to construct a dictionary with the correct indices and values:

df1['B_filled_incorrect'] = df1['B'].fillna(df2['mapcol'])

ind = df1[df1['B'].isna()].index

# reset_index is required as we might have a non-numerical index
val = df2.reset_index().loc[:len(ind-1), 'mapcol'].values

map_dict = dict(zip(ind, val))

df1['B_filled_correct'] = df1['B'].fillna(map_dict)

   A    B B_filled_incorrect B_filled_correct
0  a    s                  s                s
1  b  NaN                  b                a
2  d    h                  h                h
3  c    j                  j                j
4  d  NaN                  p                b
5  e    g                  g                g

which gives the desired output.

Is there a more straightforward way that avoids the creation of all these intermediate variables?

Cleb
  • 25,102
  • 20
  • 116
  • 151

1 Answers1

5

position fill you can assign the value via the loc and convert fill value to list

df1.loc[df1.B.isna(),'B']=df2.mapcol.iloc[:df1.B.isna().sum()].tolist()
df1
Out[232]: 
   A  B
0  a  s
1  b  a
2  d  h
3  c  j
4  d  b
5  e  g
BENY
  • 317,841
  • 20
  • 164
  • 234
  • 1
    @YOBEN_S Nice... I going down a rabbit hole of cumsum.. reindex... reset_index.. etc... etc.. etc.. +1 – Scott Boston Jul 28 '20 at 13:40
  • Mmmh, not part of the original question, but not sure whether it is worth a new one: If there are more `NaNs` in `df1` than `df2` can fill, it will fail. Do you see an easy way to take this into consideration? – Cleb Jul 28 '20 at 13:50
  • 1
    @Cleb `df3=df2.reindex(range(df1.B.isna().sum()))` :-) then `df1.loc[df1.B.isna(),'B']=df3.mapcol.tolist()` – BENY Jul 28 '20 at 14:00
  • 1
    @YOBEN_S: That seems fine; I had a rather ugly `df1.loc[np.where(df1['B'].isna())[0][:df2.mapcol.shape[0]], 'B']`... ;) I'll play a bit more and then accept in a bit. – Cleb Jul 28 '20 at 14:08