4

Supposing I have the following DataFrame df

df = pd.DataFrame({"a" : [1,2,2,2,2,2,2,2,2,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5], "b" : [3,3,3,3,3,3,3,4,4,4,4,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,6,6,7,7], "c" : [4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,1,2,2,2,2,2,2,2,2,3,3]})

And I wish to replace number 4's which repeat more than 10 times in a row, in any column (there could be hundreds of columns), with 10 4's and the remainder 5's.

So for example, 12 consecutive 4's would be replaced with ten 4's and two 5's.

How would I achieve this with Pandas?

I'd like to apply a lambda, but I don't know how to look back enough rows, and it would have to start from the end and move through forward, or it would break the sequence of values. Each lookup would have to look at the preceding 10 rows to see if they are all equal to 4, and if so, set the current value to 5.

No idea how to go about it though!

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
Chris
  • 1,888
  • 4
  • 21
  • 27

3 Answers3

4

You can use:

#column a is changed for 2 groups of 4
df = pd.DataFrame({
"a" : [4,4,4,4,4,4,4,4,4,4,4,4,4,4,7,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5], 
"b" : [3,3,3,3,3,3,3,4,4,4,4,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,6,6,7,7], 
"c" : [4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,1,2,2,2,2,2,2,2,2,3,3]})

Solution counts consecutives 4 with reset if NaN created by where and then apply boolean mask to original df for replace 4 to 5 by mask:

a = df == 4
mask = a.cumsum()-a.cumsum().where(~a).ffill().fillna(0) > 10
df1 = df.mask(mask, 5)
print (df1)
    a  b  c
0   4  3  4
1   4  3  4
2   4  3  4
3   4  3  4
4   4  3  4
5   4  3  4
6   4  3  4
7   4  4  4
8   4  4  4
9   4  4  4
10  5  4  5
11  5  5  5
12  5  5  5
13  5  5  5
14  7  5  5
15  4  5  5
16  4  5  5
17  4  5  5
18  4  5  5
19  4  5  5
20  4  5  5
21  4  5  1
22  4  5  2
23  4  5  2
24  4  5  2
25  5  5  2
26  5  5  2
27  5  5  2
28  5  6  2
29  5  6  2
30  5  7  3
31  5  7  3

For better checking values is possible use concat:

print (pd.concat([df, df1], axis=1, keys=['orig','new']))
   orig       new      
      a  b  c   a  b  c
0     4  3  4   4  3  4
1     4  3  4   4  3  4
2     4  3  4   4  3  4
3     4  3  4   4  3  4
4     4  3  4   4  3  4
5     4  3  4   4  3  4
6     4  3  4   4  3  4
7     4  4  4   4  4  4
8     4  4  4   4  4  4
9     4  4  4   4  4  4
10    4  4  4   5  4  5
11    4  5  4   5  5  5
12    4  5  4   5  5  5
13    4  5  4   5  5  5
14    7  5  4   7  5  5
15    4  5  4   4  5  5
16    4  5  4   4  5  5
17    4  5  4   4  5  5
18    4  5  5   4  5  5
19    4  5  5   4  5  5
20    4  5  5   4  5  5
21    4  5  1   4  5  1
22    4  5  2   4  5  2
23    4  5  2   4  5  2
24    4  5  2   4  5  2
25    4  5  2   5  5  2
26    4  5  2   5  5  2
27    4  5  2   5  5  2
28    4  6  2   5  6  2
29    5  6  2   5  6  2
30    5  7  3   5  7  3
31    5  7  3   5  7  3
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    Thank you for the answer :) – IanS Mar 22 '17 at 09:16
  • This is awesome. How does it work for any column- i.e. if I want to run it on columns a, b and c? Presumably I can stick it through a `for column in df.columns`? – Chris Mar 22 '17 at 09:18
  • It works in all columns nice, because all functions are implemented for dataframe. Check column `a` and `c`. – jezrael Mar 22 '17 at 09:19
  • I add `concat` for better verify output, please check last edit. – jezrael Mar 22 '17 at 09:23
  • This is unbelievable. How on earth do you become such a kung-fu Panda? I need to spend a good 30 minutes understanding how this works, but it works beautifully. Thanks. – Chris Mar 22 '17 at 10:38
  • I think it is many hours of coding. But this consecutives are really hard. Thank you for accepting! – jezrael Mar 22 '17 at 11:14
3

Remove all the 4s, fillna back with 4s using limit=10 as argument, and remove the remaining NA with 5s. I find this method more explicit and reflecting more your intention:

df[df!=4].fillna(4, limit=10).fillna(5)

If needed cast the df back to integers with astype(int) at the end, as the intrusion of NAs will cast the dataframe into floats.

Zeugma
  • 31,231
  • 9
  • 69
  • 81
  • 3
    It works only if one group of `4` in column, check my sample - second group of `4` in column `a` is replaced all values to 5 – jezrael Mar 22 '17 at 09:39
  • @jezrael BTW it's interesting I was misled by the argument docstring "If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled." – Zeugma Mar 22 '17 at 13:26
1

This should do the trick:

import pandas as pd

df = pd.DataFrame({"a" : [1,2,2,2,2,2,2,2,2,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5], "b" : [3,3,3,3,3,3,3,4,4,4,4,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,6,6,7,7], "c" : [4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,1,2,2,2,2,2,2,2,2,3,3]})

def replacer(l,target_val=4,replace_val=5,repeat_max=10):
    counter = 0
    new_l = []
    for e in l:
        if e == target_val: counter += 1
        else:
            counter = 0

        if counter > repeat_max:
            new_l.append(replace_val)
        else:
            new_l.append(e)

    return new_l

df1 = df.apply(replacer)

Output:

    a  b  c
0   1  3  4
1   2  3  4
2   2  3  4
3   2  3  4
4   2  3  4
5   2  3  4
6   2  3  4
7   2  4  4
8   2  4  4
9   3  4  4
10  3  4  5
11  4  5  5
12  4  5  5
13  4  5  5
14  4  5  5
15  4  5  5
16  4  5  5
17  4  5  5
18  4  5  5
19  4  5  5
20  4  5  5
21  5  5  1
22  5  5  2
23  5  5  2
24  5  5  2
25  5  5  2
26  5  5  2
27  5  5  2
28  5  6  2
29  5  6  2
30  5  7  3
31  5  7  3
Alex Fung
  • 1,996
  • 13
  • 21