8

Summary of Question:

Are there any easy to implement algorithms for reducing the number of points needed to represent a time series without altering how it appears in a plot?

Motivating Problem:

I'm trying to interactively visualize 10 to 15 channels of data logged from an embedded system at ~20 kHz. Logs can cover upwards of an hour of time which means that I'm dealing with between 1e8 and 1e9 points. Further, I care about potentially small anomalies that last for very short periods of time (i.e. less than 1 ms) such that simple decimation isn't an option.

Not surprisingly, most plotting libraries get a little sad if you do the naive thing and try to hand them arrays of data larger than the dedicated GPU memory. It's actually a bit worse than this on my system; using a vector of random floats as a test case, I'm only getting about 5e7 points out of the stock Matlab plotting function and Python + matplotlib before my refresh rate drops below 1 FPS.

Existing Questions and Solutions:

This problem is somewhat similar to a number of existing questions such as:

but deals with larger data sets and/or is more stringent about fidelity at the cost of interactivity (it would be great to get 60 FPS silky smooth panning and zooming, but realistically, I would be happy with 1 FPS).

Clearly, some form of data reduction is needed. There are two paradigms that I have found while searching for existing tools that solve my problem:

  • Decimate but track outliers: A good example of this is Matlab + dsplot (i.e. the tool suggested in the accepted answer of the first question I linked above). dsplot decimates down to a fixed number of evenly spaced points, but then adds back in outliers identified using the standard deviation of a high pass FIR filter. While this is probably a viable solution for several classes of data, it potentially has difficulties if there is substantial frequency content past the filter cutoff frequency and may require tuning.

  • Plot min and max: With this approach, you divide the time series up in to intervals corresponding to each horizontal pixel and plot just the minimum and maximum values in each interval. Matlab + Plot (Big) is a good example of this, but uses an O(n) calculation of min and max making it a bit slow by the time you get to 1e8 or 1e9 points. A binary search tree in a mex function or python would solve this problem, but is complicated to implemented.

Are there any simpler solutions that do what I want?

Edit (2018-02-18): Question refactored to focus on algorithms instead of tools implementing algorithms.

user2465116
  • 388
  • 2
  • 10
  • I'm not convinced this question is off-topic - I think the argument goes both ways. That aside, Camilo's answer is sensible. If you want to plot a billion points it will be slow, regardless of the tool you use. If you need it to be faster, then you will need an algorithm to remove some points. For time-series, you can make a fairly convincing argument for any algorithm that removes all observations that correspond to a jump less than some minimum bound. You might need an additional heuristic to limit the number of removals per thousand... more art than science :-) – Colin T Bowers Feb 18 '18 at 09:35
  • After looking over the guidelines again, I think the off topic flag was probably appropriate. I have heavily edited the question to focus on easy to implement algorithms instead of existing tools. It's still fairly open ended, but should hopefully be a little more on topic. – user2465116 Feb 19 '18 at 03:10

1 Answers1

6

I had the very same problem displaying pressure timeseries of hundreds of sensors, with samples every minute for several years. In some cases (like when cleaning the data), I wanted to see all the outliers, others I was more interested in the trend. So I wrote a function that can reduce the number of data points using two methods: visvalingam and Douglas-Peucker. The first tend to remove outliers, and the second keeps them. I've optimized the function to work over large datasets. I did that after realizing that all the plotting methods weren't capable to handle that many points, and the ones that did, were decimating the dataset in a way that I couldn't control. The function is the following:

function [X, Y, indices, relevance] = lineSimplificationI(X,Y,N,method,option)
%lineSimplification Reduce the number of points of the line described by X
%and Y to N. Preserving the most relevant ones.
%   Using an adapted method of visvalingam and Douglas-Peucker algorithms.
%   The number of points of the line is reduced iteratively until reaching
%   N non-NaN points. Repeated NaN points in original data are deleted but
%   non-repeated NaNs are preserved to keep line breaks.
%   The two available methods are
%
%   Visvalingam: The relevance of a point is proportional to the area of
%   the triangle defined by the point and its two neighbors.
%   
%   Douglas-Peucker: The relevance of a point is proportional to the
%   distance between it and the straight line defined by its two neighbors.
%   Note that the implementation here is iterative but NOT recursive as in 
%   the original algorithm. This allows to better handle large data sets.
%
%   DIFFERENCES: Visvalingam tend to remove outliers while Douglas-Peucker
%   keeps them.
%
%   INPUTS:
%         X: X coordinates of the line points
%         Y: Y coordinates of the line points
%    method: Either 'Visvalingam' or 'DouglasPeucker' (default)
%    option: Either 'silent' (default) or 'verbose' if additional outputs
%            of the calculations are desired.
%
% OUTPUTS:
%         X: X coordinates of the simplified line points
%         Y: Y coordinates of the simplified line points
%   indices: Indices to the positions of the points preserved in the
%            original X and Y. Therefore Output X is equal to the input
%            X(indices).
% relevance: Relevance of the returned points. It can be used to furder
%            simplify the line dinamically by keeping only points with 
%            higher relevance. But this will produce bigger distortions of 
%            the line shape than calling again lineSimplification with a 
%            smaller value for N, as removing a point changes the relevance
%            of its neighbors.
%
% Implementation by Camilo Rada - camilo@rada.cl
%

    if nargin < 3
        error('Line points positions X, Y and target point count N MUST be specified');
    end
    if nargin < 4
        method='DouglasPeucker';
    end
    if nargin < 5
        option='silent';
    end

    doDisplay=strcmp(option,'verbose');

    X=double(X(:));
    Y=double(Y(:));
    indices=1:length(Y);

    if length(X)~=length(Y)
        error('Vectors X and Y MUST have the same number of elements');
    end

    if N>=length(Y)
        relevance=ones(length(Y),1);
        if doDisplay
            disp('N is greater or equal than the number of points in the line. Original X,Y were returned. Relevances were not computed.')
        end
        return
    end
    % Removing repeated NaN from Y
    % We find all the NaNs with another NaN to the left
    repeatedNaNs= isnan(Y(2:end)) & isnan(Y(1:end-1));
    %We also consider a repeated NaN the first element if NaN
    repeatedNaNs=[isnan(Y(1)); repeatedNaNs(:)];
    Y=Y(~repeatedNaNs);
    X=X(~repeatedNaNs);
    indices=indices(~repeatedNaNs);

    %Removing trailing NaN if any
    if isnan(Y(end))
        Y=Y(1:end-1);
        X=X(1:end-1);
        indices=indices(1:end-1);
    end

    pCount=length(X);

    if doDisplay
        disp(['Initial point count = ' num2str(pCount)])
        disp(['Non repeated NaN count in data = ' num2str(sum(isnan(Y)))])
    end

    iterCount=0;

    while pCount>N
        iterCount=iterCount+1;
        % If the vertices of a triangle are at the points (x1,y1) , (x2, y2) and
        % (x3,y3) the are uf such triangle is
        % area = abs((x1*(y2-y3)+x2*(y3-y1)+x3*(y1-y2))/2)
        % now the areas of the triangles defined by each point of X,Y and its two
        % neighbors are

        twiceTriangleArea =abs((X(1:end-2).*(Y(2:end-1)-Y(3:end))+X(2:end-1).*(Y(3:end)-Y(1:end-2))+X(3:end).*(Y(1:end-2)-Y(2:end-1))));

        switch method
            case 'Visvalingam'
                % In this case the relevance is given by the area of the
                % triangle formed by each point end the two points besides
                relevance=twiceTriangleArea/2;
            case 'DouglasPeucker'
                % In this case the relevance is given by the minimum distance
                % from the point to the line formed by its two neighbors
                neighborDistances=ppDistance([X(1:end-2) Y(1:end-2)],[X(3:end) Y(3:end)]);
                relevance=twiceTriangleArea./neighborDistances;
            otherwise
                error(['Unknown method: ' method]);
        end
        relevance=[Inf; relevance; Inf];
        %We remove the pCount-N least relevant points as long as they are not contiguous

        [srelevance, sortorder]= sort(relevance,'descend');
        firstFinite=find(isfinite(srelevance),1,'first');
        startPos=uint32(firstFinite+N+1);
        toRemove=sort(sortorder(startPos:end));
        if isempty(toRemove)
            break;
        end

        %Now we have to deal with contigous elements, as removing one will
        %change the relevance of the neighbors. Therefore we have to
        %identify pairs of contigous points and only remove the one with
        %leeser relevance

        %Contigous will be true for an element if the next or the previous
        %element is also flagged for removal
        contiguousToKeep=[diff(toRemove(:))==1; false] | [false; (toRemove(1:end-1)-toRemove(2:end))==-1];
        notContiguous=~contiguousToKeep;

        %And the relevances asoociated to the elements flagged for removal
        contRel=relevance(toRemove);

        % Now we rearrange contigous so it is sorted in two rows, therefore
        % if both rows are true in a given column, we have a case of two
        % contigous points that are both flagged for removal
        % this process is demenden of the rearrangement, as contigous
        % elements can end up in different colums, so it has to be done
        % twice to make sure no contigous elements are removed
         nContiguous=length(contiguousToKeep);

        for paddingMode=1:2
            %The rearragngement is only possible if we have an even number of
            %elements, so we add one dummy zero at the end if needed
            if paddingMode==1
                if mod(nContiguous,2)
                    pcontiguous=[contiguousToKeep; false];
                    pcontRel=[contRel; -Inf];
                else
                    pcontiguous=contiguousToKeep;
                    pcontRel=contRel;
                end
            else
                if mod(nContiguous,2)
                    pcontiguous=[false; contiguousToKeep];
                    pcontRel=[-Inf; contRel];
                else
                    pcontiguous=[false; contiguousToKeep(1:end-1)];
                    pcontRel=[-Inf; contRel(1:end-1)];                    
                end
            end

            contiguousPairs=reshape(pcontiguous,2,[]);
            pcontRel=reshape(pcontRel,2,[]);

            %finding colums with contigous element
            contCols=all(contiguousPairs);
            if ~any(contCols) && paddingMode==2
                break;
            end
            %finding the row of the least relevant element of each column
            [~, lesserElementRow]=max(pcontRel);

            %The index in contigous of the first element of each pair is
            if paddingMode==1
                firstElementIdx=((1:size(contiguousPairs,2))*2)-1;
            else
                firstElementIdx=((1:size(contiguousPairs,2))*2)-2;
            end            

            % and the index in contigous of the most relevant element of each
            % pair is
            lesserElementIdx=firstElementIdx+lesserElementRow-1;

            %now we set the least relevant element as NOT continous, so it is
            %removed
            contiguousToKeep(lesserElementIdx(contCols))=false;
        end
        %and now we delete the relevant continous points from the toRemove
        %list
        toRemove=toRemove(contiguousToKeep | notContiguous);

        if any(diff(toRemove(:))==1) && doDisplay
            warning([num2str(sum(diff(toRemove(:))==1)) ' continous elements removed in one iteration.'])
        end
        toRemoveLogical=false(pCount,1);
        toRemoveLogical(toRemove)=true;

        X=X(~toRemoveLogical);
        Y=Y(~toRemoveLogical);
        indices=indices(~toRemoveLogical);

        pCount=length(X);
        nRemoved=sum(toRemoveLogical);
        if doDisplay
            disp(['Iteration ' num2str(iterCount) ', Point count = ' num2str(pCount) ' (' num2str(nRemoved) ' removed)'])
        end
        if nRemoved==0
            break;
        end
    end
end

function d = ppDistance(p1,p2)
    d=sqrt((p1(:,1)-p2(:,1)).^2+(p1(:,2)-p2(:,2)).^2);
end
Camilo Rada
  • 456
  • 3
  • 8
  • This appears to be a fairly nice heuristic. It's catching blips that a couple other heuristics miss. The biggest downside I see is the O(n*log(n)) complexity from sorting which makes it pretty hard to add more detail when zooming until you get down to a point where you can just plot everything. – user2465116 Feb 21 '18 at 07:39
  • @user2465116 For that you can use the relevance rating. So as you zoom you can pick more points without calling the function again, but just but picking more/less points based on the relevance rating of each. – Camilo Rada Feb 21 '18 at 15:24