How to optimize a custom downsampling using pandas?

Question

I have a large set of measurment data (Datetime, temperature) that I need to downsample before ploting with bokeh (to keep smooth user interface)

Because there are irregular physical phenomens I want to see, I can't just resample the data or take one sample on 4 (or 10). I need a smarter approch to judge if a sample has to be kept.

My idea is to take a reference sample, and drop following samples as long as they are close to the reference sample (inside a window around the reference sample value). When a sample is out the window, it is kept, and it become the new reference sample for the following samples. I will get a dataset without frequency, but it is not an issue I think.

The following code is an implementation of my custom / fuzzy downsampling that replicate rather good the behaviour of my data.

import numpy as np
import pandas as pd

# DataFrame / pandas serie creation
size = 300000
index = pd.date_range('01/12/2017 08:15:49', periods=size, freq="3s")
s = 10*np.sin(np.arange(0, 2*np.pi, (2*np.pi/size)))
noise = np.random.random(size)
val = s + noise
serie = pd.Series(data=val, index=index)

# fuzzy downsampling
window = 0.5
def fuzz():
    i = serie.index[0]
    fuzzy_index = [i]
    ref = serie.loc[i]
    for ind, val in serie.iteritems():
        if abs(val - ref) > window:
            fuzzy_index.append(ind)
            ref = serie.loc[ind]
    return serie.loc[fuzzy_index]

# compute downsampling
sub_serie = fuzz()

This code is working, but is slow :

%timeit fuzz()
1 loop, best of 3: 8.45 s per loop

I can't play a lot with the window because it is related to temperature measurement accuracy.

My sample size is currently 300000, but it could increase to a couple of millions in near future.

Have you any idea how to optimize/speed up this code ?

Maybe you have an other idea how to make a downsampling which has physical sense ?

Maybe there is a solution directly with bokeh server ? Idealy dependent from user's zoom ?

I don't have working code, but I would work off the pixel size of your bokeh plot, and down-sample so that you maybe have two or three data points per horizontal pixel. If you want to zoom in, re-sample with the new time interval, but still according to the pixel size of your bokeh plot. I have code that does this sort of thing in LabVIEW, and it keeps the plots nice and snappy. — Adrian Keister, Dec 22 '17 at 22:25
Since you want to do _just_ visual downsampling, perhaps consider [this](https://github.com/devoxi/lttb-py), which is based on a best-in-class series of methods tested [here](https://skemman.is/handle/1946/15343). — Nelewout, Dec 22 '17 at 23:08
Thank you @AdrianKeister for your suggestion, I thought to do something like this, but the question is how to perform downsampling : keep the max/min value, the mean value ? I'm afraid to loose information by doing such things. If you have done it successfully, I need to try and will probably find a way — Louc, Dec 24 '17 at 19:55
I didn't look to existing methods, thank you @N.Wouda to show them to me. I will have a look at these. I have great hopes, the example is so great. I just hope that this algorithm is quick enough — Louc, Dec 24 '17 at 19:57
The way data acquisition is normally done is with some sort of buffered acquisition. You keep the last, say, 30 minutes of data in memory in a spot where you can access it. You access it to log to file, and to present on graphs. What's in memory and logged to file is the full data set, but you downsample as I've illustrated for graphing. So your downsample is a copy of what's in memory - this way, you don't lose anything. Does that help? You can resample anytime you need for a zoomed in view. — Adrian Keister, Dec 27 '17 at 19:25
The lttb method suggested by @N.Wouda is working for me. Another solution I'm experimenting is using [datashader](https://github.com/bokeh/datashader). — Louc, Jan 10 '18 at 21:31

How to optimize a custom downsampling using pandas?

0 Answers0