1

This is a follow up question to this: pandas replace only part of a column

Here is my current input:

import pandas as pd
from pandas_datareader import data, wb
import numpy as np
from datetime import date

pd.set_option('expand_frame_repr', False)

df = data.DataReader('GE', 'yahoo', date (2000, 1, 1), date (2000, 2, 1))
df['x'] = np.where (df['Open'] > df['High'].shift(-2), 1, np.nan)
print (df.round(2))

# this section of code works perfectly for an integer based index.......
ii = df[pd.notnull(df['x'])].index
dd = np.diff(ii)
jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
jj = [ii[0]] + jj

for ci in jj:
    df.loc[ci:ci+2,'x'] = 1.0
# end of section that works perfectly for an integer based index......

print (df.round(2))

Here is my current output:

              Open    High     Low   Close    Volume  Adj Close    x
Date                                                                
2000-01-03  153.00  153.69  149.19  150.00  22069800      29.68  1.0 
2000-01-04  147.25  148.00  144.00  144.00  22121400      28.49  1.0
2000-01-05  143.75  147.00  142.56  143.75  27292800      28.44  NaN
2000-01-06  143.12  146.94  142.63  145.67  19873200      28.82  NaN
2000-01-07  148.00  151.88  147.00  151.31  20141400      29.94  NaN
2000-01-10  152.69  154.06  151.12  151.25  15226500      29.93  NaN
2000-01-11  151.00  152.69  150.62  151.50  15123000      29.98  NaN
2000-01-12  151.06  153.25  150.56  152.00  18342300      30.08  NaN 
2000-01-13  153.13  154.94  153.00  153.75  14953500      30.42  1.0
2000-01-14  153.38  154.63  149.56  151.00  18480300      29.88  1.0
2000-01-18  149.62  149.62  146.75  148.00  18296700      29.29  NaN
2000-01-19  146.50  150.94  146.25  148.72  14849700      29.43  NaN
2000-01-20  149.06  149.75  142.63  145.94  30759000      28.88  1.0
2000-01-21  147.94  148.25  143.94  144.13  24005400      28.52  1.0
2000-01-24  145.31  145.94  136.44  138.13  27116100      27.33  1.0
2000-01-25  138.06  140.38  137.00  138.50  25387500      27.41  NaN
2000-01-26  140.50  142.19  138.88  141.44  15856800      27.99  NaN
2000-01-27  141.56  141.75  137.06  141.75  19243500      28.05  1.0
2000-01-28  140.31  140.50  133.63  134.00  29846700      26.52  1.0
2000-01-31  134.00  135.94  133.06  134.00  21782700      26.52  NaN
2000-02-01  134.25  137.00  134.00  136.00  27339000      26.91  NaN
Traceback (most recent call last):
  File "C:\stocks\question4 for stack overflow.py", line 15, in <module>
    jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
  File "C:\stocks\question4 for stack overflow.py", line 15, in <listcomp>
    jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
TypeError: Cannot cast ufunc greater input from dtype('<m8[ns]') to dtype('<m8') with casting rule 'same_kind'

What I want to do is to change column 'x' to be a set of three 1's in a row, non-overlapping. The desired output is:

              Open    High     Low   Close    Volume  Adj Close    x
Date                                                                
2000-01-03  153.00  153.69  149.19  150.00  22069800      29.68  1.0
2000-01-04  147.25  148.00  144.00  144.00  22121400      28.49  1.0
2000-01-05  143.75  147.00  142.56  143.75  27292800      28.44  1.0
2000-01-06  143.12  146.94  142.63  145.67  19873200      28.82  NaN
2000-01-07  148.00  151.88  147.00  151.31  20141400      29.94  NaN
2000-01-10  152.69  154.06  151.12  151.25  15226500      29.93  NaN
2000-01-11  151.00  152.69  150.62  151.50  15123000      29.98  NaN
2000-01-12  151.06  153.25  150.56  152.00  18342300      30.08  NaN
2000-01-13  153.13  154.94  153.00  153.75  14953500      30.42  1.0
2000-01-14  153.38  154.63  149.56  151.00  18480300      29.88  1.0
2000-01-18  149.62  149.62  146.75  148.00  18296700      29.29  1.0
2000-01-19  146.50  150.94  146.25  148.72  14849700      29.43  NaN
2000-01-20  149.06  149.75  142.63  145.94  30759000      28.88  1.0
2000-01-21  147.94  148.25  143.94  144.13  24005400      28.52  1.0
2000-01-24  145.31  145.94  136.44  138.13  27116100      27.33  1.0
2000-01-25  138.06  140.38  137.00  138.50  25387500      27.41  NaN
2000-01-26  140.50  142.19  138.88  141.44  15856800      27.99  NaN
2000-01-27  141.56  141.75  137.06  141.75  19243500      28.05  1.0
2000-01-28  140.31  140.50  133.63  134.00  29846700      26.52  1.0
2000-01-31  134.00  135.94  133.06  134.00  21782700      26.52  1.0
2000-02-01  134.25  137.00  134.00  136.00  27339000      26.91  NaN

So, January 5, 18 and 31 change from NaN to 1.0.

As the comment above states, the second portion of the code works perfect for an integer based index. However it doesn't work when with the datetime index of dtype datetime64[ns]. I think I need just a tiny tweak to the second portion of the code to get this to work (hopefully).

Thanks in advance, David

--------------------------follow up section ------------------------------------

Thanks for hanging in there with me b2002. i am really trying to keep with the top solution due to it's brevity. when i run your code out of the box, here is the output:

original output with

...jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]...

... a[ci:ci+2] = 1.0...

              Open    High     Low   Close    Volume  Adj Close    x  ii  dd  jj  jj  desired
Date                                                                
2000-01-03  153.00  153.69  149.19  150.00  22069800      29.68  1.0  1
2000-01-04  147.25  148.00  144.00  144.00  22121400      28.49  1.0  1
2000-01-05  143.75  147.00  142.56  143.75  27292800      28.44  1.0  2          x    x
2000-01-06  143.12  146.94  142.63  145.67  19873200      28.82  1.0  3   1  
2000-01-07  148.00  151.88  147.00  151.31  20141400      29.94  NaN  4   1
2000-01-10  152.69  154.06  151.12  151.25  15226500      29.93  NaN  5   1
2000-01-11  151.00  152.69  150.62  151.50  15123000      29.98  NaN  6   1
2000-01-12  151.06  153.25  150.56  152.00  18342300      30.08  NaN  7   1
2000-01-13  153.13  154.94  153.00  153.75  14953500      30.42  1.0  1
2000-01-14  153.38  154.63  149.56  151.00  18480300      29.88  1.0  1
2000-01-18  149.62  149.62  146.75  148.00  18296700      29.29  1.0  10  3   x  x    x
2000-01-19  146.50  150.94  146.25  148.72  14849700      29.43  1.0  11  1
2000-01-20  149.06  149.75  142.63  145.94  30759000      28.88  1.0  1
2000-01-21  147.94  148.25  143.94  144.13  24005400      28.52  1.0  1
2000-01-24  145.31  145.94  136.44  138.13  27116100      27.33  1.0  1
2000-01-25  138.06  140.38  137.00  138.50  25387500      27.41  1.0  15  4   z  z
2000-01-26  140.50  142.19  138.88  141.44  15856800      27.99  1.0  16  1
2000-01-27  141.56  141.75  137.06  141.75  19243500      28.05  1.0  1
2000-01-28  140.31  140.50  133.63  134.00  29846700      26.52  1.0  1
2000-01-31  134.00  135.94  133.06  134.00  21782700      26.52  1.0  19  3   x  x    x
2000-02-01  134.25  137.00  134.00  136.00  27339000      26.91  1.0  20  1              

I am really trying to understand what is going on so i set up columns ii, dd, jj before, jj after, and desired. when i tweak the input to:

...jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]...

... a[ci:ci+1] = 1.0...

here is the output:

              Open    High     Low   Close    Volume  Adj Close    x
Date                                                                
2000-01-03  153.00  153.69  149.19  150.00  22069800      29.45  1.0
2000-01-04  147.25  148.00  144.00  144.00  22121400      28.27  1.0
2000-01-05  143.75  147.00  142.56  143.75  27292800      28.22  1.0
2000-01-06  143.12  146.94  142.63  145.67  19873200      28.60  NaN
2000-01-07  148.00  151.88  147.00  151.31  20141400      29.70  NaN
2000-01-10  152.69  154.06  151.12  151.25  15226500      29.69  NaN
2000-01-11  151.00  152.69  150.62  151.50  15123000      29.74  NaN
2000-01-12  151.06  153.25  150.56  152.00  18342300      29.84  NaN
2000-01-13  153.13  154.94  153.00  153.75  14953500      30.18  1.0
2000-01-14  153.38  154.63  149.56  151.00  18480300      29.64  1.0
2000-01-18  149.62  149.62  146.75  148.00  18296700      29.05  1.0
2000-01-19  146.50  150.94  146.25  148.72  14849700      29.19  NaN
2000-01-20  149.06  149.75  142.63  145.94  30759000      28.65  1.0
2000-01-21  147.94  148.25  143.94  144.13  24005400      28.29  1.0
2000-01-24  145.31  145.94  136.44  138.13  27116100      27.12  1.0
2000-01-25  138.06  140.38  137.00  138.50  25387500      27.19  1.0
2000-01-26  140.50  142.19  138.88  141.44  15856800      27.77  NaN
2000-01-27  141.56  141.75  137.06  141.75  19243500      27.83  1.0
2000-01-28  140.31  140.50  133.63  134.00  29846700      26.31  1.0
2000-01-31  134.00  135.94  133.06  134.00  21782700      26.31  1.0
2000-02-01  134.25  137.00  134.00  136.00  27339000      26.70  NaN

the only problem is with January 25th, where the np.diff is giving a value of 4. i just need the code to skip a value of 4 to leave the existing sets of three 1's alone. i tried to modify dd before it goes to jj with these two attempts which didn't work:

dd[dd == 4] = 1

dd = [3 if x==4 else x for x in dd]

also tried to modify the jj entry with this:

jj = [ii[i] for i in range(1,len(ii)) if ((dd == 4) or (dd[i-1] > 2))]

which gives this error message:

Traceback (most recent call last):
  File "C:\stocks\question4 for stack overflow.py", line 109, in <module>
    jj = [ii[i] for i in range(1,len(ii)) if ((dd == 4) or (dd[i-1] > 2))]
  File "C:\stocks\question4 for stack overflow.py", line 109, in <listcomp>
    jj = [ii[i] for i in range(1,len(ii)) if ((dd == 4) or (dd[i-1] > 2))]
ValueError: The truth value of an array with more than one element is     ambiguous. Use a.any() or a.all()

anyone have any ideas?

Community
  • 1
  • 1
bud fox
  • 335
  • 3
  • 16
  • you may try using `ix` for mixed label/integer based access rather than `loc`, or reset_index and perform the transformation and set_index back to Date – Anzel Feb 23 '17 at 14:17
  • Can you explain the logic of your code? What are you trying to do? Why do you need three consecutive 1's at those rows? – Parfait Feb 24 '17 at 00:48
  • parfait - this is just an example. no specific reason behind it. – bud fox Feb 24 '17 at 04:59
  • if your data is not too large and/or you are not too concerned about super fast speed, the function I wrote can be imported from a separate file and executed with a one-liner. Concerning the shorter code, I should have said the code will run vs the code will work in the first line of my answer. I can help you set up the separate file and import if you wish. Don't have time to work on the other code now unfortunately. – b2002 Feb 24 '17 at 21:42

2 Answers2

1

The code will work if it doesn't depend on the index:

#mod version
a = np.array(df.x)
ii = np.where(np.isnan(a))[0]

dd = np.diff(ii)
jj = [ii[i] for i in range(1,len(ii)) if dd[i-1] > 2]
jj = [ii[0]] + jj

for ci in jj:
    a[ci:ci+2] = 1.0
df.x = a

I'm not sure that the results are exactly what you are looking for though...

The code below allows you to search for specific patterns and then replace those patterns with other defined patterns. Drawback is that is loops through the entire array several times depending on the number of search patterns, which may or may not matter depending on the size of your data.

The 'found' patterns are marked and are not included in subsequent search loops which avoids overlapping results. So, the searches are done in a priority fashion. Adjust the elements in the patterns and fills lists to change the rules.

I think the pattern rules below produce the output desired per your previous question but it has only been tested lightly...

# search patterns in original data (zeros represent nans)
p1 = [1., 1., 1.]
p2 = [1., 0., 1.]
p3 = [1., 1., 0.]
p4 = [1., 0., 0.]

# markers to 'set aside' found patterns (can be any list of floats > 1.0 
# for each, the same float for each fill makes it easy to see which
# replacements were done where for testing...)
f1 = [5., 5., 5.]
f2 = [4., 4., 4.]
f3 = [3., 3., 3.]
f4 = [2., 2., 2.]

patterns = [p1, p2, p3, p4]
fills = [f1, f2, f3, f4]

def fill_segments(a, test_patterns, fill_patterns):
    # replace nans with zeros so fast numpy array_equal will work
    nan_idx = np.where(np.isnan(a))[0]
    np.put(a, nan_idx, 0.)
    col_index = list(np.arange(a.size))
    # loop forward through sequence comparing segment patterns
    for j in np.arange(len(test_patterns)):
        this_pattern = test_patterns[j]
        snip = len(this_pattern)
        rng = col_index[:-snip + 1]
        for i in rng:
            seg = a[col_index[i: i + snip]]
            if np.array_equal(seg, this_pattern):
                # when a match is found, replace values in array segment
                # with fill pattern
                pattern_indexes = col_index[i: i + snip]
                np.put(a, pattern_indexes, fill_patterns[j])
    # convert all fillers to ones
    np.put(a, np.where(a > 1.)[0], 1.)
    # convert zeros back to nans
    np.put(a, np.where(a == 0.)[0], np.nan)

    return a

run function and assign to df.x column

df.x = fill_segments(np.array(df.x), patterns, fills)

Input:

              Open    High     Low   Close    Volume  Adj Close    x
Date                                                                
2000-01-03  153.00  153.69  149.19  150.00  22069800  29.68      1.0
2000-01-04  147.25  148.00  144.00  144.00  22121400  28.49      1.0
2000-01-05  143.75  147.00  142.56  143.75  27292800  28.44     NaN 
2000-01-06  143.12  146.94  142.63  145.67  19873200  28.82     NaN 
2000-01-07  148.00  151.88  147.00  151.31  20141400  29.94     NaN 
2000-01-10  152.69  154.06  151.12  151.25  15226500  29.93     NaN 
2000-01-11  151.00  152.69  150.62  151.50  15123000  29.98     NaN 
2000-01-12  151.06  153.25  150.56  152.00  18342300  30.08     NaN 
2000-01-13  153.13  154.94  153.00  153.75  14953500  30.42      1.0
2000-01-14  153.38  154.63  149.56  151.00  18480300  29.88      1.0
2000-01-18  149.62  149.62  146.75  148.00  18296700  29.29     NaN 
2000-01-19  146.50  150.94  146.25  148.72  14849700  29.43     NaN 
2000-01-20  149.06  149.75  142.63  145.94  30759000  28.88      1.0
2000-01-21  147.94  148.25  143.94  144.13  24005400  28.52      1.0
2000-01-24  145.31  145.94  136.44  138.13  27116100  27.33      1.0
2000-01-25  138.06  140.38  137.00  138.50  25387500  27.41     NaN 
2000-01-26  140.50  142.19  138.88  141.44  15856800  27.99     NaN 
2000-01-27  141.56  141.75  137.06  141.75  19243500  28.05      1.0
2000-01-28  140.31  140.50  133.63  134.00  29846700  26.52      1.0
2000-01-31  134.00  135.94  133.06  134.00  21782700  26.52     NaN 
2000-02-01  134.25  137.00  134.00  136.00  27339000  26.91     NaN 

Output:

              Open    High     Low   Close    Volume  Adj Close    x
Date                                                                
2000-01-03  153.00  153.69  149.19  150.00  22069800  29.68      1.0
2000-01-04  147.25  148.00  144.00  144.00  22121400  28.49      1.0
2000-01-05  143.75  147.00  142.56  143.75  27292800  28.44      1.0
2000-01-06  143.12  146.94  142.63  145.67  19873200  28.82     NaN 
2000-01-07  148.00  151.88  147.00  151.31  20141400  29.94     NaN 
2000-01-10  152.69  154.06  151.12  151.25  15226500  29.93     NaN 
2000-01-11  151.00  152.69  150.62  151.50  15123000  29.98     NaN 
2000-01-12  151.06  153.25  150.56  152.00  18342300  30.08     NaN 
2000-01-13  153.13  154.94  153.00  153.75  14953500  30.42      1.0
2000-01-14  153.38  154.63  149.56  151.00  18480300  29.88      1.0
2000-01-18  149.62  149.62  146.75  148.00  18296700  29.29      1.0
2000-01-19  146.50  150.94  146.25  148.72  14849700  29.43     NaN 
2000-01-20  149.06  149.75  142.63  145.94  30759000  28.88      1.0
2000-01-21  147.94  148.25  143.94  144.13  24005400  28.52      1.0
2000-01-24  145.31  145.94  136.44  138.13  27116100  27.33      1.0
2000-01-25  138.06  140.38  137.00  138.50  25387500  27.41     NaN 
2000-01-26  140.50  142.19  138.88  141.44  15856800  27.99     NaN 
2000-01-27  141.56  141.75  137.06  141.75  19243500  28.05      1.0
2000-01-28  140.31  140.50  133.63  134.00  29846700  26.52      1.0
2000-01-31  134.00  135.94  133.06  134.00  21782700  26.52      1.0
2000-02-01  134.25  137.00  134.00  136.00  27339000  26.91     NaN 
Community
  • 1
  • 1
b2002
  • 914
  • 1
  • 6
  • 10
-1

--------------------- FINAL ANSWER / FINALLY SOLVED ----------- well, it's been a couple weeks of part time effort and several dozen hours but i finally got it! i know this code is a blunt instrument but it works. if anyone has suggestions on reducing the code or speeding it up, please let me know!

here is the final input:

import pandas as pd
from pandas_datareader import data, wb
import numpy as np
from datetime import date 

df = data.DataReader('GE', 'yahoo', date (2000, 1, 1), date (2000, 6, 1))
df['x'] = np.where (df['Open'] < df['High'].shift(-2), 1, np.nan)
df['x2'] = df['x']

test = 0

for i in np.nditer(df['x2'], op_flags=['readwrite']):

    if test == 4:
        test = 0

    if test == 3:
        i[...] = 3
        test = 4

    if test == 2:
        i[...] = 2
        test = 3

    if (test == 1) & (i[...] == 1):
        i[...] = 1
        test = 2

    if (test == 0) & (i[...] == 1):
        i[...] = 1
        test = 2

    if (test == 0) & (i[...] == np.nan):
        i[...] = np.nan
        test = 1

print (df.round(2))

here is a section the final output:

              Open    High     Low   Close    Volume  Adj Close    x   x2
Date                                                                     
2000-01-03  153.00  153.69  149.19  150.00  22069800      29.45  NaN  NaN
2000-01-04  147.25  148.00  144.00  144.00  22121400      28.27  NaN  NaN
2000-01-05  143.75  147.00  142.56  143.75  27292800      28.22  1.0  1.0
2000-01-06  143.12  146.94  142.63  145.67  19873200      28.60  1.0  2.0
2000-01-07  148.00  151.88  147.00  151.31  20141400      29.70  1.0  3.0
2000-01-10  152.69  154.06  151.12  151.25  15226500      29.69  1.0  1.0
2000-01-11  151.00  152.69  150.62  151.50  15123000      29.74  1.0  2.0
2000-01-12  151.06  153.25  150.56  152.00  18342300      29.84  1.0  3.0
2000-01-13  153.13  154.94  153.00  153.75  14953500      30.18  NaN  NaN
2000-01-14  153.38  154.63  149.56  151.00  18480300      29.64  NaN  NaN
2000-01-18  149.62  149.62  146.75  148.00  18296700      29.05  1.0  1.0
2000-01-19  146.50  150.94  146.25  148.72  14849700      29.19  1.0  2.0
2000-01-20  149.06  149.75  142.63  145.94  30759000      28.65  NaN  3.0
2000-01-21  147.94  148.25  143.94  144.13  24005400      28.29  NaN  NaN
2000-01-24  145.31  145.94  136.44  138.13  27116100      27.12  NaN  NaN
2000-01-25  138.06  140.38  137.00  138.50  25387500      27.19  1.0  1.0
2000-01-26  140.50  142.19  138.88  141.44  15856800      27.77  NaN  2.0
2000-01-27  141.56  141.75  137.06  141.75  19243500      27.83  NaN  3.0
2000-01-28  140.31  140.50  133.63  134.00  29846700      26.31  NaN  NaN
2000-01-31  134.00  135.94  133.06  134.00  21782700      26.31  1.0  1.0
2000-02-01  134.25  137.00  134.00  136.00  27339000      26.70  1.0  2.0
2000-02-02  137.12  137.62  134.06  134.06  21820200      26.32  1.0  3.0
2000-02-03  135.94  139.81  135.25  139.25  20232000      27.34  1.0  1.0
2000-02-04  141.00  143.12  140.50  141.56  18167100      27.79  NaN  2.0
2000-02-07  141.69  141.75  135.88  136.50  18285000      26.80  NaN  3.0

i altered the values in column x2 to display 1 - 3 instead of just 1 to see when a new series started at the end of an old series.

bud fox
  • 335
  • 3
  • 16