pandas replace only part of a column

Question

Here is my input:

import pandas as pd
import numpy as np

list1 = [10,79,6,38,4,557,12,220,46,22,45,22]
list2 = [4,3,23,6,234,47,312,2,426,42,435,23]

df = pd.DataFrame({'A' : list1, 'B' : list2}, columns = ['A', 'B'])
df['C'] = np.where (df['A'] > df['B'].shift(-2), 1, np.nan)
print (df)

that produces this output:

      A    B    C
0    10    4  NaN
1    79    3  1.0
2     6   23  NaN
3    38    6  NaN
4     4  234  NaN
5   557   47  1.0
6    12  312  NaN
7   220    2  1.0
8    46  426  NaN
9    22   42  NaN
10   45  435  NaN
11   22   23  NaN

What I need to do is to change column 'C' to be a set of three 1's in a row, non-overlapping. The desired output is:

      A    B    C
0    10    4  NaN
1    79    3  1.0
2     6   23  1.0
3    38    6  1.0
4     4  234  NaN
5   557   47  1.0
6    12  312  1.0
7   220    2  1.0
8    46  426  NaN
9    22   42  NaN
10   45  435  NaN
11   22   23  NaN

So, rows 2, 3, and 6 change from NaN to 1.0. Row 7 already has a 1.0 and it is ignored. Rows 8 and 9 need to stay NaN because row 7 is the last entry of the previous set.

I don't know if there is a better way to build column 'C' that would do this at creation.

I have tried several versions of fillna and ffill, none of them worked for me.

It seems very convoluted but I tried to isolate the row id's for each 1.0 with this line:

print (df.loc[df['C'] == 1])

Which correctly outputs this:

     A   B    C
1   79   3  1.0
5  557  47  1.0
7  220   2  1.0

Even though I know that information, I don't know how to proceed from there.

Thank you so much for your help in advance, David

What if you have a sequence as `1, NaN, 1,1,NaN,NaN`? what would you like the result to be? — Psidom, Feb 22 '17 at 03:16
What is the desired output if the sequence is:nan, 1, nan, 1, nan, 1, nan, 1, nan — b2002, Feb 22 '17 at 13:52
thanks for looking b2002 - that sequence would be: nan, 1, 1, 1, nan, 1, 1, 1, nan — bud fox, Feb 23 '17 at 00:42
in that case, I am curious if my answer works for your data! — b2002, Feb 23 '17 at 01:19
b2002 - i changed my list input to produce the 'every other' sequence you described above. your code works perfectly! thanks. — bud fox, Feb 23 '17 at 05:03

Khris · Accepted Answer · 2017-02-23T14:20:18.317

4

EDIT:

Faster version (thanks to b2002):

ii = df[pd.notnull(df.C)].index
dd = np.diff(ii)
jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
jj = [ii[0]] + jj

for ci in jj:
    df.C.values[ci:ci+3] = 1.0

First get the indices of all your starting points, i.e. all your points that are 1.0 and have two NaN following, by looking at the differences between the points that are not null in the C column (first index is included by default), then iterate over those indices and use loc to change slices of your C column:

ii = df[pd.notnull(df.C)].index
dd = np.diff(ii)
jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
jj = [ii[0]] + jj

for ci in jj:
    df.loc[ci:ci+2,'C'] = 1.0

Result:

      A    B    C
0    10    4  NaN
1    79    3  1.0
2     6   23  1.0
3    38    6  1.0
4     4  234  NaN
5   557   47  1.0
6    12  312  1.0
7   220    2  1.0
8    46  426  NaN
9    22   42  NaN
10   45  435  NaN
11   22   23  NaN

edited Feb 23 '17 at 14:20

answered Feb 22 '17 at 07:28

Khris

3,132
3
34
54

very nice answer! Really like numpy diff on the indexes. – b2002 Feb 22 '17 at 13:06
after some timing, this can be sped up tremendously for larger dataframes if the assignment of ones is done to a numpy array of df.C rather than using df.loc which turns out to be really slow. – b2002 Feb 22 '17 at 16:00
Khris - thanks so much. your solution worked perfectly on the same df – bud fox Feb 23 '17 at 05:04

b2002 · Answer 2 · 2017-02-22T20:26:32.173

list1 = [10,79,6,38,4,557,12,220,46,22,45,22]
list2 = [4,3,23,6,234,47,312,2,426,42,435,23]

df = pd.DataFrame({'A' : list1, 'B' : list2}, columns = ['A', 'B'])
df['C'] = np.where (df['A'] > df['B'].shift(-2), 1, np.nan)

      A    B    C
0    10    4  NaN
1    79    3  1.0
2     6   23  NaN
3    38    6  NaN
4     4  234  NaN
5   557   47  1.0
6    12  312  NaN
7   220    2  1.0
8    46  426  NaN
9    22   42  NaN
10   45  435  NaN
11   22   23  NaN

make an array from sequence:

a = np.array(df.C)

This function will test segments of an array for matching patterns and will replace segments which match with another pattern. Previously matched segments will not be considered for future matches (the filler numbers are greater than one).

def fill_segments(a, test_patterns, fill_patterns):
    # replace nans with zeros so fast numpy array_equal will work
    nan_idx = np.where(np.isnan(a))[0]
    np.put(a, nan_idx, 0.)
    col_index = list(np.arange(a.size))
    # loop forward through sequence comparing segment patterns
    for j in np.arange(len(test_patterns)):
        this_pattern = test_patterns[j]
        snip = len(this_pattern)
        rng = col_index[:-snip + 1]
        for i in rng:
            seg = a[col_index[i: i + snip]]
            if np.array_equal(seg, this_pattern):
                # when a match is found, replace values in array segment
                # with fill pattern
                pattern_indexes = col_index[i: i + snip]
                np.put(a, pattern_indexes, fill_patterns[j])
    # convert all fillers to ones
    np.put(a, np.where(a > 1.)[0], 1.)
    # convert zeros back to nans
    np.put(a, np.where(a == 0.)[0], np.nan)

    return a

Patterns to be replaced:

p1 = [1., 1., 1.]
p2 = [1., 0., 1.]
p3 = [1., 1., 0.]
p4 = [1., 0., 0.]

And corresponding fill patterns:

f1 = [5., 5., 5.]
f2 = [4., 4., 4.]
f3 = [3., 3., 3.]
f4 = [2., 2., 2.]

make test_patterns and fill_patterns inputs

patterns = [p1, p2, p3, p4]
fills = [f1, f2, f3, f4]

run function:

a = fill_segments(a, patterns, fills)

assign a to column C

df.C = a

df:

      A    B    C
0    10    4  NaN
1    79    3  1.0
2     6   23  1.0
3    38    6  1.0
4     4  234  NaN
5   557   47  1.0
6    12  312  1.0
7   220    2  1.0
8    46  426  NaN
9    22   42  NaN
10   45  435  NaN
11   22   23  NaN

The patterns and fills may need to be adjusted/added to depending on the way the input column is initially populated and the specific result sequence rules.

b2002 - thanks so much for your solution. it worked perfectly on the sample df — bud fox, Feb 23 '17 at 05:05
b2002 and Khris - thanks for your solutions, they both worked. I screwed up when i created the example above because I was actually trying to use this with a datetime index not an integer index. Unfortunately, the solutions don't seem to be working with that dtype. I posted a follow up question here, if you care to look at it. Thanks. — bud fox, Feb 23 '17 at 14:10
http://stackoverflow.com/questions/42418035/pandas-replace-only-part-of-a-column-with-datetime-index — bud fox, Feb 23 '17 at 14:10

pandas replace only part of a column

2 Answers2

Linked