Count unique rows in a cell full of vectors

Question

I have a cell in MATLAB where each element contains a vector of a different length

e.g.

C = {[1 2 3], [2 4 5 6], [1 2 3], [6 4], [7 6 4 3], [4 6], [6 4]}

As you can see, some of the the vectors are repeated, others are unique.

I want to count the number of times each vector occurs and return the count such that I can populate a table in a GUI where each row is a unique combination and the date shows how many times each combination occurs.

e.g.

            Count
"[1 2 3]"     2
"[6 4]"       2
"[2 4 5 6]"   1
"[7 6 4 3]"   1
"[4 6]"       1

I should say that the order of the numbers in each vector is important i.e. [6 4] is not the same as [4 6].

Any thoughts how I can do this fairly efficiently?

Thanks to people who have commented so far. As @Divakar kindly pointed out, I forgot to mention that the values in the vector can be more than one digit long. i.e. [46, 36 28]. My original code would concatenate the vector [1 2 3 4] into 1234 then use hist to do the counting. Of course this falls apart when you got above single digits as you can tell the difference between [1, 2, 3, 4] and [12, 34].

When you said "efficiently", do you mean runtime efficiency or something else? — Divakar, Sep 26 '14 at 18:57
Yep - fast run time. Reason being I'm going to have to apply this to very large cells and hence lots of vectors. :-) — Mark, Sep 26 '14 at 19:09

Divakar · Accepted Answer · 2014-09-27T00:07:07.793

6

You can convert all the entries to char and then to a 2D numeric array and finally use unique(...'rows') to get labels for unique rows and use them to get their counts.

C = {[46, 36 28], [2 4 5 6], [46, 36 28], [6 4], [7 6 4 3], [4 6], [6 4]} %// Input

char_array1 = char(C{:})-0; %// convert input cell array to a char array
[~,unqlabels,entry_labels] = unique(char_array1,'rows'); %// get unique rows
count = histc(entry_labels,1:max(entry_labels)); %// counts of each unique row

For the purpose of presenting the output in a format as asked in the question, you can use this -

out = [C(unqlabels)' num2cell(count)];

Output -

out = 
    [1x4 double]    [1]
    [1x2 double]    [1]
    [1x2 double]    [2]
    [1x4 double]    [1]
    [1x3 double]    [2]

and display the unique rows with celldisp -

ans{1} =
     2     4     5     6
ans{2} =
     4     6
ans{3} =
     6     4
ans{4} =
     7     6     4     3
ans{5} =
    46    36    28

Edit: If you have negative numbers in there, you need to do little more work to setup char_array1 as shown here and rest of the code stays the same -

lens = cellfun(@numel,C);
mat1(max(lens),numel(lens))=0;
mat1(bsxfun(@ge,lens,[1:max(lens)]')) = horzcat(C{:});
char_array1 = mat1';

edited Sep 27 '14 at 00:07

answered Sep 26 '14 at 18:30

Divakar

218,885
19
262
358

1

+1 Good to know that behaviour of `char` applied on a cell array. It automatically pads columns! – Luis Mendo Sep 26 '14 at 20:56
1

@LuisMendo Yeah. Well for efficiency one has to quickly to get off of cell arrays to numeric arrays if possible and do further operations! I just don't trust cell arrays for efficient implementations. – Divakar Sep 26 '14 at 20:57
1

nice trick with the `char`, but what do you do if one of the cell elements has a negative value? for example `C={[46, -36, 28],...` ? – bla Sep 26 '14 at 23:28
@natan Edited for that case too. – Divakar Sep 27 '14 at 00:07
Clever idea for fast cell2mat conversion with pads +1. BTW, Problem with zero padding is that (in your general case), when the actual matrix had zero at the end, it behaves strangely. As `unique` doesn't recognize padded value to actual values. here is an example if `C{1} = [1 2 3 0]` and `C{3}` remains `[1 2 3]`, the program outputs the same, instead of 'one count' each. `NaN` pad wont work either as each `NaN` is unique. Any other way to make this generalized? Seems like @bla's solution works for any possible case. – Santhan Salai May 31 '15 at 15:04
@SanthanSalai Well I am padding with ascii equivalent of "space" i.e. `32`, so it won't work when there is a `32` in one of the cells. Having zeros in the cells should still work with it. – Divakar May 31 '15 at 15:20
yeah 32 in your 1st case and 0 in your 2nd case(generalized for +ve and -ve) It works if they are in-between but not when they are at last. – Santhan Salai May 31 '15 at 15:23
@SanthanSalai Ah yeah, that's right! Well yeah it some limitations for sure :) – Divakar May 31 '15 at 15:24
Still a great solution. Learnt lot from this & also other solutions.. thanks :) – Santhan Salai May 31 '15 at 15:26

bla · Answer 2 · 2014-09-26T18:40:30.863

A way I can think of is to convert to strings and then use unique

Cs = cellfun(@(x)(mat2str(x)),C,'uniformoutput',false);
[Cu,idx_u,idx] = unique(Cs);

now you can count the number of occurrences with idx, for instance using

fv=tabulate(idx)

so fv, has already all the info you need, but for purposes of display I'll add:

[Cu' , num2cell(fv(:,2))]

ans = 

'[1 2 3]'      [2]
'[2 4 5 6]'    [1]
'[4 6]'        [1]
'[6 4]'        [2]
'[7 6 4 3]'    [1]

score 3 · Answer 3 · edited Sep 26 '14 at 19:04

Another suggestion I can think of is to convert each array into a concatenation of numbers, then do a histogram to count how many values you have per entry. We would need to figure out how many unique numbers we have first, which would serve as the histogram edges through unique.

One thing I will need to note is that we are assuming that each element in your array for each cell is a single digit. This obviously won't work if there are numbers that are two digits or more.

In other words:

%// Convert each array of numbers into a single number
numbers = cellfun(@(x) sum(x.*10.^(numel(x)-1:-1:0)), C);
%// Find unique numbers
uniNumbers = unique(numbers);

%// Get histogram
out = histc(numbers, uniNumbers);

%// Display counts
disp([uniNumbers; out]);

out would contain the counts per unique number in your cell array. We get:

      46          64         123        2456        7643
       1           2           2           1           1

The trick with the first line of code is that I'm using the decomposition of numbers in base 10 where each digit can be uniquely represented as a sum of multiples of powers of 10. As such, 4587 can be represented as:

4000 + 500 + 80 + 7 ==> 4*10^3 + 5*10^2 + 8*10^1 + 7*10^0

I took each number in our array, and used those as coefficients for each decreasing power of 10, then summed them all together. As such, in your cell arrays, [1 2 3], is converted to 123, and so on. With your example, this is the output of numbers, which is doing what I talked about above:

numbers =

  Columns 1 through 6

         123        2456         123          64        7643          46

  Column 7

          64

Compare this with your actual cell array in C:

celldisp(C)

C{1} = 
     1     2     3     
C{2} =
     2     4     5     6
C{3} =
     1     2     3 
C{4} =
     6     4
C{5} =
     7     6     4     3
C{6} =
     4     6
C{7} =
     6     4

Interesting move with that single number conversion, but do mention the assumption that this assumes single digit entries. wooah I just struck me, I am assuming it too! — Divakar, Sep 26 '14 at 18:35
@Divakar - Oh that's right! I'll amend that :) Thank you very much. — rayryeng, Sep 26 '14 at 18:36
Oh! Well pointed out @Divakar my initial example was a little poor as the numbers can be two digits I.e. [49, 64, 25]. My initial code assumed single digit cakes but when I went above 9 it all fell apart! — Mark, Sep 26 '14 at 18:39
@Mark - In that case, natan's code will work as it creates a string representation out of each number. My method will not work then. Should I delete? — rayryeng, Sep 26 '14 at 18:41
@Mark Okay seems like my code should withstand double or more digits too! — Divakar, Sep 26 '14 at 18:45

Count unique rows in a cell full of vectors

3 Answers3

Linked