2

I have 4 n-by-1 column vectors where sharing the same index number means they are of the same timestamp. I want to remove "rows" that are identical to their immediate preceding "rows" and imagine having this performed recursively until no change.

For example, suppose the 4 vectors are

C1=[1;1;3;3;1;1];
C2=[2;2;4;4;2;2];
C3=[0;0;0;0;0;0];
C4=[5;5;6;6;5;5];

The desired output is

ans=[1;3;5];

because [C1(ans),C2(ans),C3(ans),C4(ans)] is an array with no row identical to its preceding row. In the above example, the resulting vectors look like:

C1=[1;3;1];
C2=[2;4;2];
C3=[0;0;0];
C4=[5;6;5];

"Rows" as in the rows when looking at the vectors concatenated column-wise with [C1,C2,C3,C4].

The question:

  • I understand how to do it with a loop. How do you do that with native Matlab functions?

Some notes:

The reason I started with 4 separated column vectors is as follows:

  1. I have one other n-by-1 vector with unique elements where I will be removing the same "rows" based on the indices removed for the other 4 vectors;

  2. in my application, the data is retrieved from elsewhere and stored into a Maltab data type element by element for further processing and I encounter performance advantage with storing into 4 N-by-1 double over into 1 N-by-4 double. This N is in the hundreds of thousands or millions.

n is typically only several thousands at a time but I have a need to minimize the time each filtering takes as much within 1 second and small as possible.

(I want to learn the methods using native functions and compare performance.)


Note on performance

It's a bit hard to demonstrate performance differences on this one since random data is not suitable and too specific data is unsuitable. (By hard, I mean it's hard to do quickly.)

But in case anyone is interested, with a table of ~164k rows and only ~1k "unique" rows, ("" around rows as well,) the results from timeit() are as follows.

  • Cris' diff or method: 0.0028s

  • Wolfie's unique method: 0.0142s

  • Wolfie's arrayfun method: 0.3912s

  • Thomas' diff*ones method: 0.0057s

  • Thomas' recursion method: Unable to complete. This blew up Matlab's RAM request to ~70GB within a minute of execution under timeit() and caused UI freeze on my Win 10 machine despite of the machine having lots of un-used CPU.

  • Loop (but with varargin on num of columns): 3.6313s

The testing functions included concatenating if not directly processing columns.

The loop version is:

function varargout = accum(varargin)

    for i=1:numel(varargin)
        varargout{i}=varargin{i}(1);    % assuming single column
    end

    for i=2:numel(varargin{1})  % assuming equal length
        TF=false;
        for j=1:numel(varargin)
            TF=TF||varargin{j}(i)~=varargin{j}(i-1);
        end
        if TF
            for j=1:numel(varargin)
                varargout{j}=[varargout{j};varargin{j}(i)];
            end
        end
    end

end

If you are writing another answer and need sample data, let me know. Otherwise, I'll skip pasting it, seeing little use in doing so.

Argyll
  • 8,591
  • 4
  • 25
  • 46
  • Just a question, I guess that you have no control on the original data ? Because in such a case a realtime database (firebase, rethinkDB,...) that can push the data to the app, could greatly reduce the amount of computation needed. – obchardon Dec 21 '20 at 09:30
  • @obchardon: Ya I have no control over the data source. In time I'll just use a different vendor. But my current source works to some extent and it's just inexpensive. So in order to get some work going, I need to use the current vendor and in any case the vendor provides something valuable -- being cheap -- in a unique way. – Argyll Jan 02 '21 at 14:18

3 Answers3

4

I think that the following gives the desired output (not tested):

find([1; diff(C1) | diff(C2) | diff(C3) | diff(C4)])

diff is non-zero where two subsequent elements are different. Using logical OR we require that any one vector has a difference at any one position. The first element is always part of the output. find returns indices of non-zero elements.

Cris Luengo
  • 55,762
  • 10
  • 62
  • 120
2

Here is an option using logical values to subset rows in matrix

C([true; abs(C(2:end,:)-C(1:end-1,:))*ones(size(C,2),1)>0],:)

which gives

ans =

   1   2   0   5
   3   4   0   6
   1   2   0   5

If you don't mind using a user function method, below might be another option, where myfun recursively computes the "unique" rows

function y = myfun(x)
  if size(x,1)==1
    y = x;
  else
    v = x(end,:);
    y = myfun(x(1:(end-1),:));
    if ~all(y(end,:)==v)
      y = [y;v];
    end
   end
end

such that

>> z = myfun(C)
z =

   1   2   0   5
   3   4   0   6
   1   2   0   5

where C = [C1,C2,C3,C4]

ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
  • Thanks for the answer. But `unique(_,'rows')` shouldn't work because the 1st and 3rd elements in the example may be identical because there may be intermediate different "rows". I'll edit the example to show it. – Argyll Dec 20 '20 at 20:24
  • @Argyll Thanks for your feedback. I updated my answer by defining a user function. Hope it makes sense – ThomasIsCoding Dec 20 '20 at 20:58
  • The code makes sense. Thank you for the solution. I'll post a note re benchmark later after more answers. – Argyll Dec 20 '20 at 21:40
  • @Argyll I guess the user function might be slow if you have many rows. I added another method which should be much faster. – ThomasIsCoding Dec 20 '20 at 23:43
0

You could use a similar approach to the answer given by Cris (find(diff(...)))), but make it more generic using unique.

Setup:

C1=[1;1;3;3;1;1];
C2=[2;2;4;4;2;2];
C3=[0;0;0;0;0;0];
C4=[5;5;6;6;5;5];

C = [C1,C2,C3,C4];

Method one:

[~,~,iu] = unique( C, 'rows' );
idx = find( [1; diff(iu)] );

Alternatively, you could loop through (shorthanded with arrayfun) to find rows where any element differs from the previous row

Method two:

idx = find( [1, arrayfun( @(ii) any(C(ii,:) ~= C(ii-1,:)), 2:size(C,1) )] )
Wolfie
  • 27,562
  • 7
  • 28
  • 55