4

EDITED:

I have (not a very simple) a dataframe:

df = pd.DataFrame([1, 2, np.nan, np.nan, np.nan, np.nan, 3, 4
    , np.nan, np.nan, np.nan, 5], columns=['att1'])

     att1
0  1.0000
1  2.0000
2     nan
3     nan
4     nan
5     nan
6  3.0000
7  4.0000
8     nan
9     nan
10    nan
11 5.0000

I want fill NAN values with the previous not NAN value except the last NAN value. I want the last NAN value to be NAN after filling. How can I do that?

I want this result:

     att1
0  1.0000
1  2.0000
2  2.0000
3  2.0000
4  2.0000
5     nan
6  3.0000
7  4.0000
8  4.0000
9  4.0000
10    nan
11 5.0000

I tried this:

df = df.fillna(value='missing', method='bfill', limit=1)
df = df.fillna(method='ffill')

But the first row gives this error:

ValueError: cannot specify both a fill method and value

Why there is this limitation in pandas 0.17.1 / Python 3.5? Thank you!

ragesz
  • 9,009
  • 20
  • 71
  • 88
  • IIUC then doesn't this do what you want: `df.fillna(value='missing', limit=3)`? – EdChum Feb 03 '16 at 13:22
  • Thank you! I edited my question because first I used a simple `df` but I have several `NAN` blocks in my dataframe so I don't know the `forward` `limit`, only the `backward` `limit` that is `1`. – ragesz Feb 03 '16 at 13:42

3 Answers3

4

You can count NaN in df['att1'], substract 1 and then it use as parameter limits to fillna:

import pandas as pd
import numpy as np

df = pd.DataFrame([1, 2, np.nan, np.nan, np.nan, np.nan, 3] , columns=['att1'])
print df
   att1
0     1
1     2
2   NaN
3   NaN
4   NaN
5   NaN
6     3

s = df['att1'].isnull().sum() - 1
df['att1'] = df['att1'].fillna('missing', limit=s)
print df
      att1
0        1
1        2
2  missing
3  missing
4  missing
5      NaN
6        3

EDIT:

Now it is more complicated.

So first set helper column count for counting consecutives values of column att1 by isnull, shift, astype and cumsum. Then groupby by this column count and fillna:

import pandas as pd
import numpy as np

df = pd.DataFrame([1, 2, np.nan, np.nan, np.nan, np.nan, 3, 4
    , np.nan, np.nan, np.nan, 5], columns=['att1'])
print df

df['count'] = (df['att1'].isnull() != df['att1'].isnull().shift()).astype(int).cumsum()
print df
    att1  count
0      1      1
1      2      1
2    NaN      2
3    NaN      2
4    NaN      2
5    NaN      2
6      3      3
7      4      3
8    NaN      4
9    NaN      4
10   NaN      4
11     5      5
def f(x):
    att = x['att1'].isnull()
    if(att.all()):
        return x['att1'].fillna('missing', limit=att.sum() - 1)
    else:
        return x['att1']

print df.groupby(['count']).apply(f).reset_index(drop=True)

0           1
1           2
2     missing
3     missing
4     missing
5         NaN
6           3
7           4
8     missing
9     missing
10        NaN
11          5
Name: att1, dtype: object

Explaining column count:

print (df['att1'].isnull() != df['att1'].isnull().shift())
0      True
1     False
2      True
3     False
4     False
5     False
6      True
7     False
8      True
9     False
10    False
11     True
Name: att1, dtype: bool
print (df['att1'].isnull() != df['att1'].isnull().shift()).astype(int)
0     1
1     0
2     1
3     0
4     0
5     0
6     1
7     0
8     1
9     0
10    0
11    1
Name: att1, dtype: int32
print (df['att1'].isnull() != df['att1'].isnull().shift()).astype(int).cumsum()
0     1
1     1
2     2
3     2
4     2
5     2
6     3
7     3
8     4
9     4
10    4
11    5
Name: att1, dtype: int32
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    Thank you! I used a simple `df` example but of course I have several blocks of `NAN` values in my dataframe so when I calculate the `s` for `limit`, this solution will fail :( – ragesz Feb 03 '16 at 13:25
1

An alternative method that is maybe a little less complex would just to be to create a list of index points at which you expect there to be NaNs (where the index point is not null, but the index point before is null). Then you just forward fill your data and reinsert the NaNs using the list you created.

import pandas as pd
import numpy as np
from numpy import nan as NA
df = pd.DataFrame([1, 2, np.nan, np.nan, np.nan, np.nan, 3, 4
    , np.nan, np.nan, np.nan, 5], columns=['att1'])

#create list of index points where you want NaNs to be be
Nan_ind = [x - 1 for x in xrange(1, df.index[-1] + 1) 
                if pd.notnull(df.loc[x, 'att1'])
                and pd.isnull(df.loc[x-1, 'att1'])]

#forward fillna             
df['att1'] = df['att1'].fillna(method = 'ffill')

#reinsert NaNs using your list of index points
df.loc[Nan_ind, 'att1'] = NA
Woody Pride
  • 13,539
  • 9
  • 48
  • 62
0

fillna all NaN with "missing". The last "missing" you can replace with NaN.

df['att1'].fillna("missing",inplace=True)
df.iloc[[-2]].replace("missing",NaN)

using negative value for iloc search index backwards. -2 return the value of the forelast element of the 'att1' column.

Arthur Zennig
  • 2,058
  • 26
  • 20