1

Hello I have an huge list of values, I want to to find all n values pattern like list[0:30], list[1:31]. And to each value compare percentage to the first, like percentage_change(array[0],array[1]), percentage_change(array[0],array[2]), all the way till the end of pattern. After this, I want to store all the 30 values patterns in an array of patterns to compare to other values in the future.

To do so I have to build a function: To this function, 30 values can be changed to any of my choices by change variable numberOfEntries For each pattern, I do the mean of the 10 next outcomes and store it in an array of outcomes with the same index

#end point is the end of array
#inputs (array, numberOfEntries)
#outPut(list of Patterns, list of outcomes)

y=0
condition= numberOfEntries+1
#each pattern list
pattern=[]
#list of patterns
Patterns=[] 
#outcomes array
outcomes=[]



while (y<len(array)):
    i=1
    while(i<condition):

        #this is percentage change function, I have built it inside to gain speed. Try is used because possibility of 0 division
        try:
            x = ((float(array[y-(numberOfEntries-i)])-array[y-numberOfEntries])/abs(array[y-numberOfEntries]))*100.00
            if x == 0.0:
                x=0.000000001
        except:
            x= 0.00000001
        i+=1
        pattern.append(x)
 #here is the outcomes
     outcomeRange = array[y+5:y+15]
     outcome.append(outcomeRange)
     Patterns.append(pattern)
     #clean pattern array
     pattern=[]
     y+=1

Doing this to an 8559 values array, which is small for the quantity of data I have took me 229.6792.

There is a way of adapt this to multithreading or an way of improve this speed?

EDIT:

To explain better, I have this ohlc data:

                     open      high       low     close      volume
TimeStamp                                                            
2016-08-20 15:50:00  0.003008  0.003008  0.002995  0.003000    6.351215
2016-08-20 15:55:00  0.003000  0.003008  0.003000  0.003008    6.692174
2016-08-20 16:00:00  0.003008  0.003009  0.002996  0.003001   10.813029
2016-08-20 16:05:00  0.003001  0.003000  0.002991  0.002991    4.368509
2016-08-20 16:10:00  0.002991  0.002993  0.002989  0.002990    6.662944
2016-08-20 16:15:00  0.002990  0.003015  0.002989  0.003015    8.495640

I extract this as

array=df['close'].values

Then I apply this array to the function and it will return a list full of lists like this for this particular set of values,

[0.26, 0.03, -0.03, -0.04, ,0.005]

This are percent changes from each row to the begin of the sample, and this is what I call a pattern. I can choose how much entries can have a pattern.

Hope I'm more clear now...

hopieman
  • 399
  • 7
  • 22
  • Multithreading is a dead-end, don't pursue it. Potentially multiprocessing, but the ideal approach would be vectorization of your loops. – roganjosh Jan 10 '18 at 20:53
  • 2
    For that reason, I want to tag this as numpy but it looks like you're just using python lists (despite saying you have np arrays)? – roganjosh Jan 10 '18 at 20:55
  • actually at this point i'm not using numpy, just pandas than return a list. @roganjosh how can I use vectorization of loops? – hopieman Jan 10 '18 at 20:58
  • 1
    Then the question is pretty confused. You could use a rolling window on your series, keeping it in pandas. Your example should be representative of what you're trying to do but, at a guess, you don't want to pull this data out as a python list. – roganjosh Jan 10 '18 at 21:03
  • "pandas that return a list", do you mean a Pandas Series? If so, that behaves very similarly to a numpy array – scnerd Jan 10 '18 at 21:03
  • Just to add some additional info here -- may help here or for a future problem with speed. If you can stick with numpy consider using numba for just in time compiling: https://numba.pydata.org/ – rmilletich Jan 10 '18 at 21:30
  • I did edit my ideia, hope you can undestand better now, There is anyway of do this with pandas library like create another column or something – hopieman Jan 10 '18 at 22:11
  • I compiled it in Numba, just simple compile, and it improved by 5 seconds the script..224.94061994552612 seconds – hopieman Jan 10 '18 at 22:59
  • I am not sure if that code really does what it is intended for. For example your first calculation (y=0,numberOfEntries=30,i=1) means that you are accessing the element -29... Is this really intended? – max9111 Jan 11 '18 at 13:35
  • yes it is! because i want to compare actual element[y] to last [30values] – hopieman Jan 11 '18 at 17:10

1 Answers1

2

First, I would turn the while loop to a for loop, since i is now incremented faster.

for i in range(1,condition):

Now, since y doesn't change within your inner loop, you can optimize your computation from:

x = ((float(array[y-(numberOfEntries-i)])-array[y-numberOfEntries])/abs(close[y-numberOfEntries]))*100.00

to:

x = (float(array[y-(numberOfEntries-i)])-array[y-numberOfEntries]) * z

where z is precomputed before the while/for loop as:

    z = 100.00 / abs(close[y-numberOfEntries])

why?

  • first, z is pre-computed so no computation of abs and access to close array
  • second, z is the inverse of the value to divide, so you can multiply by it. Multiplication is way faster than division.
  • third: no more division by zero is possible since you're no longer dividing. The zerodiv can occur on z outside the loop, and has to be handled accordingly (wrap the whole z + loop thing in try/except and set result to x= 0.00000001 when it occurs, it should be equivalent)

so your inner loop could be:

try:
    z = 100.00 / abs(close[y-numberOfEntries])
    for i in range(1,condition):
        x = (float(array[y-(numberOfEntries-i)])-array[y-numberOfEntries]) * z
except ZeroDivisionError:
    x = 0.00000001
pattern.append(x)
Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219