Find unique rows of a cell array considering all possible permutations on each row

Question

I have cell array A of dimension m * k.

I want to keep the rows of A unique up to an order of the k cells.

The "tricky" part is "up to an order of the k cells": consider the k cells in the ith row of A, A(i,:); there could be a row j of A, A(j,:), that is equivalent to A(i,:) up to a re-ordering of its k cells, meaning that for example if k=4it could be that:

A{i,1}=A{j,2}
A{i,2}=A{j,3}
A{i,3}=A{j,1}
A{i,4}=A{j,4}

What I am doing at the moment is:

G=[0 -1 1; 0 -1 2; 0 -1 3; 0 -1 4; 0 -1 5; 1 -1 6; 1 0 6; 1 1 6; 2 -1 6; 2 0 6; 2 1 6; 3 -1 6; 3 0 6; 3 1 6]; 
h=7;
M=reshape(G(nchoosek(1:size(G,1),h),:),[],h,size(G,2));
A=cell(size(M,1),2);
for p=1:size(M,1)
    A{p,1}=squeeze(M(p,:,:)); 
    left=~ismember(G, A{p,1}, 'rows');
    A{p,2}=G(left,:); 
end

%To find equivalent rows up to order I use a double loop (VERY slow).
indices=[]; 
for j=1:size(A,1)
    if ismember(j,indices)==0 %if we have not already identified j as a duplicate
        for i=1:size(A,1)
            if i~=j
               if (isequal(A{j,1},A{i,1}) || isequal(A{j,1},A{i,2}))...
                  &&...
                  (isequal(A{j,2},A{i,1}) || isequal(A{j,2},A{i,2}))...
                  indices=[indices;i]; 
               end
            end
        end
    end
end
A(indices,:)=[];

It works but it is too slow. I am hoping that there is something quicker that I can use.

Hi! The question is unfinished. You added "What I am doing at the moment is:", it lacks the part of "and it doesn't work because:" — Ander Biguri, Oct 10 '16 at 10:50
Can you be a bit more descriptive? I don't know what * up to an order of the k sub-cells* means and I can not induce it from the code. — Ander Biguri, Oct 10 '16 at 12:44
In your example for all `p`, the size of `A{p,1}` and `A{p,2}` are equal. Is it gonna be the case always? In other words, is `G` always going to be divided to half between left and right cells? — Erfan, Oct 10 '16 at 13:38
@erfan no, it is not always the case. The sub-cells can have different measures. — TEX, Oct 10 '16 at 15:23
could you give us an idea on how those dimensions relate? how big is A and how many dimensions have to be considered (like 4 or 90%)? how many duplicates do you expect in the matrix. How are the subcell dimensions and how different are they? Those 4 equality check will take time if the subcells are big. Is there maybe a special index that could be checked? — Finn, Oct 11 '16 at 08:55
@Finn: A can have at most around 40,000 rows, k can be at most 5, each subcell of A can have at most 7 rows, each subcell of A has always 3 columns. — TEX, Oct 11 '16 at 10:39

score 6 · Accepted Answer · edited May 23 '17 at 12:18

I'd like to propose another idea, which has some conceptual resemblance to erfan's. My idea uses hash functions, and specifically, the GetMD5 FEX submission.

The main task is how to "reduce" each row in A to a single representative value (such as a character vector) and then find unique entries of this vector.

Judging by the benchmark vs. the other suggestions, my answer doesn't perform as well as one of the alternatives, but I think its raison d'être lies in the fact that it is completely data-type agnostic (within the limitations of the GetMD5¹), that the algorithm is very straightforward to understand, it's a drop-in replacement as it operates on A, and that the resulting array is exactly equal to the one obtained by the original method. Of course this requires a compiler to get working and has a risk of hash collisions (which might affect the result in VERY VERY rare cases).

Here are the results from a typical run on my computer, followed by the code:

Original method timing:     8.764601s
Dev-iL's method timing:     0.053672s
erfan's method timing:      0.481716s
rahnema1's method timing:   0.009771s

function q39955559
G=[0 -1 1; 0 -1 2; 0 -1 3; 0 -1 4; 0 -1 5; 1 -1 6; 1 0 6; 1 1 6; 2 -1 6; 2 0 6; 2 1 6; 3 -1 6; 3 0 6; 3 1 6]; 
h=7;
M=reshape(G(nchoosek(1:size(G,1),h),:),[],h,size(G,2));
A=cell(size(M,1),2);
for p=1:size(M,1)
    A{p,1}=squeeze(M(p,:,:)); 
    left=~ismember(G, A{p,1}, 'rows');
    A{p,2}=G(left,:); 
end

%% Benchmark:
tic
A1 = orig_sort(A);
fprintf(1,'Original method timing:\t\t%fs\n',toc);

tic
A2 = hash_sort(A);
fprintf(1,'Dev-iL''s method timing:\t\t%fs\n',toc);

tic
A3 = erfan_sort(A);
fprintf(1,'erfan''s method timing:\t\t%fs\n',toc);

tic
A4 = rahnema1_sort(G,h);
fprintf(1,'rahnema1''s method timing:\t%fs\n',toc);

assert(isequal(A1,A2))
assert(isequal(A1,A3))
assert(isequal(numel(A1),numel(A4)))  % This is the best test I could come up with...

function out = hash_sort(A)
% Hash the contents:
A_hashed = cellfun(@GetMD5,A,'UniformOutput',false);
% Sort hashes of each row:
A_hashed_sorted = A_hashed;
for ind1 = 1:size(A_hashed,1)
  A_hashed_sorted(ind1,:) = sort(A_hashed(ind1,:));
end
A_hashed_sorted = cellstr(cell2mat(A_hashed_sorted));
% Find unique rows:
[~,ia,~] = unique(A_hashed_sorted,'stable');
% Extract relevant rows of A:
out = A(ia,:);

function A = orig_sort(A)
%To find equivalent rows up to order I use a double loop (VERY slow).
indices=[]; 
for j=1:size(A,1)
    if ismember(j,indices)==0 %if we have not already identified j as a duplicate
        for i=1:size(A,1)
            if i~=j
               if (isequal(A{j,1},A{i,1}) || isequal(A{j,1},A{i,2}))...
                  &&...
                  (isequal(A{j,2},A{i,1}) || isequal(A{j,2},A{i,2}))...
                  indices=[indices;i]; 
               end
            end
        end
    end
end
A(indices,:)=[];

function C = erfan_sort(A)
STR = cellfun(@(x) num2str((x(:)).'), A, 'UniformOutput', false);
[~, ~, id] = unique(STR);
IC = sort(reshape(id, [], size(STR, 2)), 2);
[~, col] = unique(IC, 'rows');
C = A(sort(col), :); % 'sort' makes the outputs exactly the same.

function A1 = rahnema1_sort(G,h)
idx = nchoosek(1:size(G,1),h);
%concatenate complements
M = [G(idx(1:size(idx,1)/2,:),:), G(idx(end:-1:size(idx,1)/2+1,:),:)];
%convert to cell so A1 is unique rows of A
A1 = mat2cell(M,repmat(h,size(idx,1)/2,1),repmat(size(G,2),2,1));

¹ - If more complicated data types need to be hashed, one can use the DataHash FEX submission instead, which is somewhat slower.

Nice! Actually, to be most efficient while covering the general case, one should use your `GetMD5` idea with my method of sorting! — Erfan, Oct 19 '16 at 10:18

Erfan · Answer 2 · 2016-10-14T11:13:47.683

Stating the problem: The ideal choice in identifying unique rows in an array is to use C = unique(A,'rows'). But there are two major problems here, preventing us from using this function in this case. First is that you want to count in all the possible permutations of each row when comparing to other rows. If A has 5 columns, it means checking 120 different re-arrangements per row! Sounds impossible.

The second issue is related to unique itself; It does not accept cells except cell arrays of character vectors. So you cannot simply pass A to unique and get what you expect.

Why looking for an alternative? As you know, because currently it is very slow:

With nested loop method:
------------------- Create the data (first loop):
Elapsed time is 0.979059 seconds.
------------------- Make it unique (second loop):
Elapsed time is 14.218691 seconds.

My solution:

Generate another cell array containing same cells, but converted to string (STR).
Find the index of all unique elements there (id).
Generate the associated matrix with the unique indices and sort rows (IC).
Find unique rows (rows).
Collect corresponding rows of A (C).

And this is the code:

disp('------------------- Create the data:')
tic
G = [0 -1 1; 0 -1 2; 0 -1 3; 0 -1 4; 0 -1 5; 1 -1 6; 1 0 6; ...
    1 1 6; 2 -1 6; 2 0 6; 2 1 6; 3 -1 6; 3 0 6; 3 1 6];
h = 7;
M = reshape(G(nchoosek(1:size(G,1),h),:),[],h,size(G,2));
A = cell(size(M,1),2);
for p = 1:size(M,1)
    A{p, 1} = squeeze(M(p,:,:));
    left = ~ismember(G, A{p,1}, 'rows');
    A{p,2} = G(left,:);
end
STR = cellfun(@(x) num2str((x(:)).'), A, 'UniformOutput', false);
toc

disp('------------------- Make it unique (vectorized):')
tic
[~, ~, id] = unique(STR);
IC = sort(reshape(id, [], size(STR, 2)), 2);
[~, col] = unique(IC, 'rows');
C = A(sort(col), :); % 'sort' makes the outputs exactly the same.
toc

Performance check:

------------------- Create the data:
Elapsed time is 1.664119 seconds.
------------------- Make it unique (vectorized):
Elapsed time is 0.017063 seconds.

Although initialization needs a bit more time and memory, this method is extremely faster in finding unique rows with the consideration of all permutations. Execution time is almost insensitive to the number of columns in A.

score 3 · Answer 3 · edited May 23 '17 at 10:27

3

It seems that G is a misleading point. Here is result of nchoosek for a small number

idx=nchoosek(1:4,2)
ans =

   1   2
   1   3
   1   4
   2   3
   2   4
   3   4

first row is complement of the last row

second row is complement of one before the last row

.....

so if we extract rows {1 , 2} from G then its complement will be rows {3, 4} and so on. In the other words if we assume number of rows of G to be 4 then G(idx(1,:),:) is complement of G(idx(end,:),:).

Since rows of G are all unique then all A{m,n}s always have the same size.

A{p,1} and A{p,2} are complements of each other. and size of unique rows of A is size(idx,1)/2

So no need to any loop or further comparison:

h=7;
G = [0 -1 1; 0 -1 2; 0 -1 3; 0 -1 4; 0 -1 5; 1 -1 6; 1 0 6; ...
    1 1 6; 2 -1 6; 2 0 6; 2 1 6; 3 -1 6; 3 0 6; 3 1 6];
idx = nchoosek(1:size(G,1),h);
%concatenate complements
M = [G(idx(1:size(idx,1)/2,:).',:), G(idx(end:-1:size(idx,1)/2+1,:).',:)];
%convert to cell so A1 is unique rows of A
A1 = mat2cell(M,repmat(h,size(idx,1)/2,1),repmat(size(G,2),2,1));

Update: Above method works best however if the idea is to get A1 from A other than G I suggest following method based of erfan' s. Instead of converting array to string we can directly work with the array:

STR=reshape([A.'{:}],numel(A{1,1}),numel(A)).';
[~, ~, id] = unique(STR,'rows');

IC = sort(reshape(id, size(A, 2),[]), 1).';
[~, col] = unique(IC, 'rows');
C1 = A(sort(col), :);

Since I use Octave I can not currently run mex file then I cannot test Dev-iL 's method

Result:

erfan method (string):  4.54718 seconds.
rahnema1 method (array): 0.012639 seconds.

Online Demo

edited May 23 '17 at 10:27

Community

1
1

answered Oct 16 '16 at 07:12

rahnema1

15,264
3
15
27

Maybe I got it wrong, but I think the idea is not to create `A1` from `G`, but from `A`. So `G` and `M` are just for creating arbitrary data for the question. Assume you have `A` as input, what would you do? – EBH Oct 16 '16 at 12:22
1

@EBH I don't necessarily agree with you - it is a question of what the "real" input data is. The whole beauty of describing the bigger picture is that somebody might suggest a completely different approach which works much better. I like exactly this aspect of the above answer. @ rahnema - How do you make sure that your `A1` is equivalent to the OP's `A`? Since the order of rows is different, I think you should include some validation code. – Dev-iL Oct 16 '16 at 12:37
@EBH If your assumption is correct , I possibly do what erfan has done! – rahnema1 Oct 16 '16 at 12:37
@Dev-iL simple nested loop `[r c] = size(A);for m = 1:r;for n = 1:c; if ~isequal(A{m,n} ,A1{m,n});disp('somwthig is wrong!');break;end;end;end;` – rahnema1 Oct 16 '16 at 12:54
2

@rahnema1 When the order of elements is the same, you can just `isequal(A,A1)` - there's no need for loops. Since `isequal` (for the complete `A` array) didn't work with your `A1` but your `A1` was the correct size, I assumed it was because of a different ordering of elements. The loop you suggested is doomed because of the aforementioned ordering of elements. – Dev-iL Oct 16 '16 at 13:18
@dev-il Well, my chrome nany is angry now, do I'll get into it after chag, but the main question is if it solves the OP's problem. – EBH Oct 16 '16 at 13:26
@EBH what is the OP's problem? – rahnema1 Oct 16 '16 at 13:34
@EBH I'm looking forward to your solution :) Happy סוכות! – Dev-iL Oct 16 '16 at 13:36
I think erfan's answer starts from `STR = cellfun...` – EBH Oct 16 '16 at 13:37
@rahnema1 I like your last edit, that the kind of solution I had in mind and I think it's the best solution so far (for getting `A1` from `A`...), you got my +1. – EBH Oct 17 '16 at 18:11
@EBH Thanks for your useful comments and motivating **upvotes**! – rahnema1 Oct 17 '16 at 19:18
1

The only problem is that [_"The sub-cells can have different measures"_](http://stackoverflow.com/questions/39955559/find-unique-rows-of-a-cell-array-considering-all-possible-permutations-on-each-r/40067837?noredirect=1#comment67202931_39955559) – EBH Oct 18 '16 at 05:14
@EBH `string` and `hash (mex)` are general purpose solutions that fit to general ones but are not necessarily most efficient ones. In the question sould precisely specified how the data are generated so more specific solutions provided – rahnema1 Oct 18 '16 at 07:25
2

@rahnema1 Nice work! my thoughts had the same direction when I started. One point, that in MATLAB `A.'{:}` syntax does not work. It should be done in two steps. The only drawback of your method is being limited to the example. – Erfan Oct 19 '16 at 10:13

Find unique rows of a cell array considering all possible permutations on each row

3 Answers3

Linked