How to remove repeating rows in a double array so that no row is identical to its preceding row

Question

I have 4 n-by-1 column vectors where sharing the same index number means they are of the same timestamp. I want to remove "rows" that are identical to their immediate preceding "rows" and imagine having this performed recursively until no change.

For example, suppose the 4 vectors are

C1=[1;1;3;3;1;1];
C2=[2;2;4;4;2;2];
C3=[0;0;0;0;0;0];
C4=[5;5;6;6;5;5];

The desired output is

ans=[1;3;5];

because [C1(ans),C2(ans),C3(ans),C4(ans)] is an array with no row identical to its preceding row. In the above example, the resulting vectors look like:

C1=[1;3;1];
C2=[2;4;2];
C3=[0;0;0];
C4=[5;6;5];

"Rows" as in the rows when looking at the vectors concatenated column-wise with [C1,C2,C3,C4].

The question:

I understand how to do it with a loop. How do you do that with native Matlab functions?

Some notes:

The reason I started with 4 separated column vectors is as follows:

I have one other n-by-1 vector with unique elements where I will be removing the same "rows" based on the indices removed for the other 4 vectors;
in my application, the data is retrieved from elsewhere and stored into a Maltab data type element by element for further processing and I encounter performance advantage with storing into 4 N-by-1 double over into 1 N-by-4 double. This N is in the hundreds of thousands or millions.

n is typically only several thousands at a time but I have a need to minimize the time each filtering takes as much within 1 second and small as possible.

(I want to learn the methods using native functions and compare performance.)

Note on performance

It's a bit hard to demonstrate performance differences on this one since random data is not suitable and too specific data is unsuitable. (By hard, I mean it's hard to do quickly.)

But in case anyone is interested, with a table of ~164k rows and only ~1k "unique" rows, ("" around rows as well,) the results from timeit() are as follows.

Cris' diff or method: 0.0028s
Wolfie's unique method: 0.0142s
Wolfie's arrayfun method: 0.3912s
Thomas' diff*ones method: 0.0057s
Thomas' recursion method: Unable to complete. This blew up Matlab's RAM request to ~70GB within a minute of execution under timeit() and caused UI freeze on my Win 10 machine despite of the machine having lots of un-used CPU.
Loop (but with varargin on num of columns): 3.6313s

The testing functions included concatenating if not directly processing columns.

The loop version is:

function varargout = accum(varargin)

    for i=1:numel(varargin)
        varargout{i}=varargin{i}(1);    % assuming single column
    end

    for i=2:numel(varargin{1})  % assuming equal length
        TF=false;
        for j=1:numel(varargin)
            TF=TF||varargin{j}(i)~=varargin{j}(i-1);
        end
        if TF
            for j=1:numel(varargin)
                varargout{j}=[varargout{j};varargin{j}(i)];
            end
        end
    end

end

If you are writing another answer and need sample data, let me know. Otherwise, I'll skip pasting it, seeing little use in doing so.

Just a question, I guess that you have no control on the original data ? Because in such a case a realtime database (firebase, rethinkDB,...) that can push the data to the app, could greatly reduce the amount of computation needed. — obchardon, Dec 21 '20 at 09:30
@obchardon: Ya I have no control over the data source. In time I'll just use a different vendor. But my current source works to some extent and it's just inexpensive. So in order to get some work going, I need to use the current vendor and in any case the vendor provides something valuable -- being cheap -- in a unique way. — Argyll, Jan 02 '21 at 14:18

score 4 · Accepted Answer · answered Dec 20 '20 at 20:37

I think that the following gives the desired output (not tested):

find([1; diff(C1) | diff(C2) | diff(C3) | diff(C4)])

diff is non-zero where two subsequent elements are different. Using logical OR we require that any one vector has a difference at any one position. The first element is always part of the output. find returns indices of non-zero elements.

ThomasIsCoding · Answer 2 · 2020-12-20T23:42:58.500

2

Here is an option using logical values to subset rows in matrix

C([true; abs(C(2:end,:)-C(1:end-1,:))*ones(size(C,2),1)>0],:)

which gives

ans =

   1   2   0   5
   3   4   0   6
   1   2   0   5

If you don't mind using a user function method, below might be another option, where myfun recursively computes the "unique" rows

function y = myfun(x)
  if size(x,1)==1
    y = x;
  else
    v = x(end,:);
    y = myfun(x(1:(end-1),:));
    if ~all(y(end,:)==v)
      y = [y;v];
    end
   end
end

such that

>> z = myfun(C)
z =

   1   2   0   5
   3   4   0   6
   1   2   0   5

where C = [C1,C2,C3,C4]

edited Dec 20 '20 at 23:42

answered Dec 20 '20 at 20:20

ThomasIsCoding

96,636
9
24
81

Thanks for the answer. But `unique(_,'rows')` shouldn't work because the 1st and 3rd elements in the example may be identical because there may be intermediate different "rows". I'll edit the example to show it. – Argyll Dec 20 '20 at 20:24
@Argyll Thanks for your feedback. I updated my answer by defining a user function. Hope it makes sense – ThomasIsCoding Dec 20 '20 at 20:58
The code makes sense. Thank you for the solution. I'll post a note re benchmark later after more answers. – Argyll Dec 20 '20 at 21:40
@Argyll I guess the user function might be slow if you have many rows. I added another method which should be much faster. – ThomasIsCoding Dec 20 '20 at 23:43

Wolfie · Answer 3 · 2020-12-21T10:18:49.880

You could use a similar approach to the answer given by Cris (find(diff(...)))), but make it more generic using unique.

Setup:

C1=[1;1;3;3;1;1];
C2=[2;2;4;4;2;2];
C3=[0;0;0;0;0;0];
C4=[5;5;6;6;5;5];

C = [C1,C2,C3,C4];

Method one:

[~,~,iu] = unique( C, 'rows' );
idx = find( [1; diff(iu)] );

Alternatively, you could loop through (shorthanded with arrayfun) to find rows where any element differs from the previous row

Method two:

idx = find( [1, arrayfun( @(ii) any(C(ii,:) ~= C(ii-1,:)), 2:size(C,1) )] )

How to remove repeating rows in a double array so that no row is identical to its preceding row

The question:

Note on performance

3 Answers3

Linked