4

Let's say I have a 1d numpy array with some noisy data series in it.

I want to establish a threshold to check when the values are high and when low. However, since the data is noisy, it doesn't make sense to just do

is_high = data > threshold

I was trying to set a tolerance to this threshold, as many control systems do (e.g. most heating and air-con systems). The idea is that the state of the signal only changes from low to high when passes the threshold plus a tolerance. The same way, the signal will only change from high to low if it gets below a threshold minus a tolerance. In other words:

def tolerance_filter(data, threshold, tolerance):
    currently_high = False  # start low
    signal_state = np.empty_like(data, dtype=np.bool)
    for i in range(data.size):
        # if we were high and we are getting too low, become low
        if currently_high and data[i] < (threshold-tolerance):
            currently_high = False
        # if we were low and are getting too high, become high
        elif not currently_high and data[i] > (threshold+tolerance):
            currently_high = True
        signal_state[i] = currently_high
    return signal_state

This function gives the output that I'm expecting. However, I wonder if there is any way to do this using the speed of numpy or scipy instead of a raw python for loop.

Any ideas? :)


UPDATE:

Thanks to Joe Kington's comment that pointed me to the hysteresis term, I found this other SO question. I'm afraid it's quite similar (duplicate?), and there's one nice working solution there by Bas Swinckels, too.

Anyway, I tried to implement the speedups suggested by Joe Kington (No idea if I did it right) and compared his solution, Fergal's and Bas' against my naïve approach. Here are the results (code below):

Proposed function in my original question
10 loops, best of 3: 22.6 ms per loop

Proposed function by Fergal
1000 loops, best of 3: 995 µs per loop

Proposed function by Bas Swinckels in the hysteresis question
1000 loops, best of 3: 1.05 ms per loop

Proposed function by Joe Kington using Cython
Approximate time cost of compiling: 2.195411
1000 loops, best of 3: 1.35 ms per loop

All the approaches in answers perform similarly (though Fergal's would need some extra steps to get the boolean vector!). Any consideration to add here? Also, I'm surprised that the Cython approach is slower (only slightly, though). In any case, I have to admit it's probably the fastest one to code if you don't know all numpy functions by heart...

Here is the code I used to benchmark the different options. Audits and revisions are more than welcome! :P (the Cython code is in the middle to force SO to keep all the code in the same scrollable chunk. Of course I had it in a different file)

# Naive approach from the original question
def tolerance_filter1(data, threshold, tolerance):
    currently_high = False  # start low
    signal_state = np.empty_like(data, dtype=np.bool)
    for i in range(data.size):
        # if we were high and we are getting too low, become low
        if currently_high and data[i] < (threshold-tolerance):
            currently_high = False
        # if we were low and are getting too high, become high
        elif not currently_high and data[i] > (threshold+tolerance):
            currently_high = True
        signal_state[i] = currently_high
    return signal_state
        
# Numpythonic approach suggested by Fergal
def tolerance_filter2(data, threshold, tolerance):
    a = np.zeros_like(data)
    a[ data < threshold-tolerance] = -1
    a[ data > threshold+tolerance] = +1
    wh = np.where(a != 0)[0]
    idx= np.diff( a[wh]) == 2
    #This variable indexes the values of data where data crosses
    #from below threshold-tol to above threshold+tol
    crossesAboveThreshold = wh[idx]
    return crossesAboveThreshold
    
# Approach suggested by Bas Swinckels and borrowed
# from the hysteresis question
def tolerance_filter3(data, threshold, tolerance, initial=False):
    hi = data >= threshold+tolerance
    lo_or_hi = (data <= threshold-tolerance) | hi
    ind = np.nonzero(lo_or_hi)[0]
    if not ind.size: # prevent index error if ind is empty
        return np.zeros_like(x, dtype=bool) | initial
    cnt = np.cumsum(lo_or_hi) # from 0 to len(x)
    return np.where(cnt, hi[ind[cnt-1]], initial)
    
#########################################################
## IN A DIFFERENT FILE (tolerance_filter_cython.pyx)
## So that StackOverflow shows a single scrollable code block :)

import numpy as np
import cython

@cython.boundscheck(False)
def tolerance_filter(data, float threshold, float tolerance):
    cdef bint currently_high = 0  # start low
    signal_state = np.empty_like(data, dtype=int)
    cdef double[:] data_view = data
    cdef long[:] signal_state_view = signal_state
    cdef int i = 0
    cdef int l = len(data)
    low = np.empty_like(data, dtype=bool)
    high = np.empty_like(data, dtype=bool)
    low = data < (threshold-tolerance)
    high = data > (threshold+tolerance)
    
    for i in range(l):
        # if we were high and we are getting too low, become low
        if currently_high and low[i]:
            currently_high = False
        # if we were low and are getting too high, become high
        elif not currently_high and high[i]:
            currently_high = True
        signal_state_view[i] = currently_high
    return signal_state

##################################################################
# BACK TO THE PYTHON FILE

import numpy as np
from time import clock
from datetime import datetime
from IPython import get_ipython
ipython = get_ipython()
time = np.arange(0,1000,0.01)
data = np.sin(time*3) + np.cos(time/7)*8 + np.random.normal(size=time.shape)*2
threshold, tolerance = 0, 4

print "Proposed function in my original question"
ipython.magic("timeit tolerance_filter1(data, threshold, tolerance)")

print "\nProposed function by Fergal"
ipython.magic("timeit tolerance_filter2(data, threshold, tolerance)")

print "\nProposed function by Bas Swinckels in the hysteresis question"
ipython.magic("timeit tolerance_filter3(data, threshold, tolerance)")

print "\nProposed function by Joe Kington using Cython"
start = datetime.now()
import pyximport; pyximport.install()
import tolerance_filter_cython
print "Approximate time cost of compiling: {}".format((datetime.now()-start).total_seconds())
tolerance_filter4 = tolerance_filter_cython.tolerance_filter
ipython.magic("timeit tolerance_filter4(data, threshold, tolerance)")
Community
  • 1
  • 1
mgab
  • 3,147
  • 4
  • 19
  • 30

2 Answers2

2

I think it's sometimes surprising to see how easy and Python-like are cython extensions. Here is your code transformed into cython. It can be called from Python, but should give you C++ speeds.

def tolerance_filter(data, float threshold, float tolerance):
    cdef bint currently_high = 0  # start low
    signal_state = np.empty_like(data, dtype=int)
    cdef float[:] data_view = data
    cdef int[:] signal_state_view = signa_state
    cdef int i = 0
    cdef int l = len(data)
    for i in range(l):
        # if we were high and we are getting too low, become low
        if currently_high and data[i] < (threshold-tolerance):
            currently_high = False
        # if we were low and are getting too high, become high
        elif not currently_high and data[i] > (threshold+tolerance):
            currently_high = True
        signal_state_view[i] = currently_high

There are a few things to notice:

  • Note the use in the function's start of typed memory views

  • The function was deliberately kept as close as possible to your original code. It can be speeded up, however, by turning off range checking (refer to the Cython docs), and calculating the upper and lower effective thresholds outside the loop.

Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
  • Yeah! Cython is this thing I'm always thinking I have to find a moment to play with... but never get to it... nice! +1 for the lesson! :P. Btw, when you mention _turning off range checking_, do you mean the `@cython..boundscheck(False)` decorator? I'm not sure I succeeded in my search.... – mgab May 28 '15 at 11:00
  • Yup, that is absolutely the decorator. Good luck! – Ami Tavory May 28 '15 at 11:24
0

I'm not sure this is better than your solution, but is it more numpythonic.

a = np.zeros_like(data)
a[ data < threshold-tol] = -1
a[ data > threshold+tol] = +1
wh = np.where(a != 0)
idx= np.diff( a[wh]) == 2
#This variable indexes the values of data where data crosses
#from below threshold-tol to above threshold+tol
crossesAboveThreshold = wh[idx]
Fergal
  • 494
  • 3
  • 12