5

I have two dataframes like this:

import pandas as pd
import numpy as np


df1 = pd.DataFrame({
    'key1': list('ABAACCA'),
    'key2': list('1675987'),
    'prop1': list('xyzuynb'),
    'prop2': list('mnbbbas')
}).set_index(['key1', 'key2'])

df2 = pd.DataFrame({
    'key1': list('ABCCADD'),
    'key2': list('1598787'),
    'prop1': [np.nan] * 7,
    'prop2': [np.nan] * 7
}).set_index(['key1', 'key2'])

          prop1 prop2
key1 key2            
A    1        x     m
B    6        y     n
A    7        z     b
     5        u     b
C    9        y     b
     8        n     a
A    7        b     s

           prop1  prop2
key1 key2              
A    1       NaN    NaN
B    5       NaN    NaN
C    9       NaN    NaN
     8       NaN    NaN
A    7       NaN    NaN
D    8       NaN    NaN
     7       NaN    NaN

and would now like to use df1 to fill df2 using

df2.fillna(df1)

however,I get

site-packages/pandas/core/generic.py in _where(self, cond, other, inplace, axis, level, errors, try_cast) 8694
other._get_axis(i).equals(ax) for i, ax in enumerate(self.axes)
8695 ): -> 8696 raise InvalidIndexError 8697 8698 # slice me out of the other

InvalidIndexError:

I used this approach successfully in the past and I do not really understand why that one fails. Any ideas how to make it work?

EDIT

Here is an example which is very similar and works perfectly fine:

filler1 = pd.DataFrame({
    'key': list('AAABCCDD'),
    'prop1': list('xyzuyasj'),
    'prop2': list('mnbbbqwo')
})

tobefilled1 = pd.DataFrame({
    'key': list('AAABBCACDF'),
    'keep_me': ['stuff'] * 10,
    'prop1': [np.nan] * 10,
    'prop2': [np.nan] * 10,
    
})

filler1['g'] = filler1.groupby('key').cumcount()
tobefilled1['g'] = tobefilled1.groupby('key').cumcount()

filler1 = filler1.set_index(['key', 'g'])
tobefilled1 = tobefilled1.set_index(['key', 'g'])

print(tobefilled1.fillna(filler1))

prints

key g                    
A   0   stuff     x     m
    1   stuff     y     n
    2   stuff     z     b
B   0   stuff     u     b
    1   stuff   NaN   NaN
C   0   stuff     y     b
A   3   stuff   NaN   NaN
C   1   stuff     a     q
D   0   stuff     s     w
F   0   stuff   NaN   NaN
Cleb
  • 25,102
  • 20
  • 116
  • 151

2 Answers2

5

The problem here is the duplicate index defined in df1:

df1 = pd.DataFrame({
    'key1': list('ABAACCA'),
    'key2': list('1675987'),
    'prop1': list('xyzuynb'),
    'prop2': list('mnbbbas')
}).set_index(['key1', 'key2'])

Note: Key1=A Key2=7 appears twice, the index for df1 is not unique.

Let's change that second A7 to A9

df1 = pd.DataFrame({
    'key1': list('ABAACCA'),
    'key2': list('1675989'),
    'prop1': list('xyzuynb'),
    'prop2': list('mnbbbas')
}).set_index(['key1', 'key2'])

df2 = pd.DataFrame({
    'key1': list('ABCCADD'),
    'key2': list('1598787'),
    'prop1': [np.nan] * 7,
    'prop2': [np.nan] * 7
}).set_index(['key1', 'key2'])

Thus creating unique indexing in df1, now try df.fillna:

df2.fillna(df1)

Output:

          prop1 prop2
key1 key2            
A    1        x     m
B    5      NaN   NaN
C    9        y     b
     8        n     a
A    7        z     b
D    8      NaN   NaN
     7      NaN   NaN

I got hint of this when I tried the reindex_like method, first with unique indexing:

df1 = pd.DataFrame({
    'key1': list('ABAACCA'),
    'key2': list('1675989'),
    'prop1': list('xyzuynb'),
    'prop2': list('mnbbbas')
}).set_index(['key1', 'key2'])

df2 = pd.DataFrame({
    'key1': list('ABCCADD'),
    'key2': list('1598787'),
    'prop1': [np.nan] * 7,
    'prop2': [np.nan] * 7
}).set_index(['key1', 'key2'])
print(df1.reindex_like(df2))

Output:

          prop1 prop2
key1 key2            
A    1        x     m
B    5      NaN   NaN
C    9        y     b
     8        n     a
A    7        z     b
D    8      NaN   NaN
     7      NaN   NaN

Now, let's revert to the original dataframes in the post:

df1 = pd.DataFrame({
    'key1': list('ABAACCA'),
    'key2': list('1675987'),
    'prop1': list('xyzuynb'),
    'prop2': list('mnbbbas')
}).set_index(['key1', 'key2'])

df2 = pd.DataFrame({
    'key1': list('ABCCADD'),
    'key2': list('1598787'),
    'prop1': [np.nan] * 7,
    'prop2': [np.nan] * 7
}).set_index(['key1', 'key2'])
print(df1.reindex_like(df2))

Output ValueError:

ValueError: cannot handle a non-unique multi-index!

Another work-around it to create unique indexing by adding another index level with cumcount.

df1 = pd.DataFrame({
    'key1': list('ABAACCA'),
    'key2': list('1675987'),
    'prop1': list('xyzuynb'),
    'prop2': list('mnbbbas')
}).set_index(['key1', 'key2'])

df2 = pd.DataFrame({
    'key1': list('ABCCADD'),
    'key2': list('1598787'),
    'prop1': [np.nan] * 7,
    'prop2': [np.nan] * 7
}).set_index(['key1', 'key2'])

df1 = df1.set_index(df1.groupby(df1.index).cumcount(), append=True)
df2 = df2.set_index(df2.groupby(df2.index).cumcount(), append=True)

df2.fillna(df1)

Output:

            prop1 prop2
key1 key2              
A    1    0     x     m
B    5    0   NaN   NaN
C    9    0     y     b
     8    0     n     a
A    7    0     z     b
D    8    0   NaN   NaN
     7    0   NaN   NaN

Then you can drop index level 2:

df2.fillna(df1).reset_index(level=2, drop=True)

Output:

          prop1 prop2
key1 key2            
A    1        x     m
B    5      NaN   NaN
C    9        y     b
     8        n     a
A    7        z     b
D    8      NaN   NaN
     7      NaN   NaN

However, I think pandas should have nicer error messaging for fillna non-unique MultiIndexes like it does for reindex_like.

Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • 2
    The indexing issue evidently explains this. Adding `verify_integrity=True` throws a ValueError in this line: `df1 = pd.DataFrame({...}).set_index(['key1', 'key2'], verify_integrity=True)` – jsmart Jul 28 '20 at 22:23
  • Ah, nice, did not notice this one; I ran into this when creating a minimal example for my actual problem where I use the `cumsum` approach indeed; so, in my actual use cases I always have unique indices, but in my toy example I screwed up... ;) Upvoted for now, will accept and reward the bounty later on... – Cleb Jul 29 '20 at 06:46
1

Here is problem some index values not match, for me working alternative solution with DataFrame.combine_first:

df = df2.combine_first(df1)
print (df)
          prop1 prop2
key1 key2            
A    1        x     m
     5        u     b
     7        z     b
     7        b     s
B    5      NaN   NaN
     6        y     n
C    8        n     a
     9        y     b
D    7      NaN   NaN
     8      NaN   NaN
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks. But still, do you know why the example from above fails and for example [this one](https://stackoverflow.com/a/57787848/1534017) (your own answer :) ) works? – Cleb Jul 08 '20 at 07:56
  • @Cleb - Honestly not idea, I guess difference is all values of index match in linked answer. – jezrael Jul 08 '20 at 07:59
  • Ok. That's what I thought as well, but I also have examples where I still have a lot of `NAN`s left, but it works fine; this one here is a bit of a mystery to me. – Cleb Jul 08 '20 at 08:05
  • @Cleb - Maybe bug, because error is very strange, and not clear. – jezrael Jul 08 '20 at 08:07
  • 1
    Let's see whether someone else has an idea; I added a working example to my question where one still has `NAN`s left. – Cleb Jul 08 '20 at 08:43