6

I have a (large) cell array, with various data types. For example,

 myCell = { 1, 2, 3, 'test',  1 , 'abc';
            4, 5, 6, 'foob', 'a', 'def' };

This can include more obscure types like java.awt.Color objects.

I want to ensure that the data in each column is of the same type, since I want to perform table-like operations on it. However, this process seems very slow!

My current method is to use cellfun to get the classes, and strcmp to check them

% Get class of every cell element
types = cellfun( @class, myCell, 'uni', false );
% Check that they are consistent for each column
typesOK = all( strcmp(repmat(types(1,:), size(types,1), 1), types), 1 );
% Output the types (mixed type columns can be handled using typesOK)
types = types(1, :);

% Output for the above example: 
% >> typesOK = [1 1 1 1 0 1]
% >> types = {'double', 'double', 'double', 'char', 'double', 'char'}

I had thought to use cell2table, since it does type checking for the same reason. However, it doesn't give me the desired result (which columns are which types, strictly).

Is there a quicker way to check type consistency within a cell array's columns?


Edit: I've just done some profiling...

It appears the types = cellfun( @class, ...) line takes over 90% of the processing time. If your method is only subtly different to mine, it should be that line which changes, the strcmp is pretty quick.


Edit: I was fortunate to have many suggestions for this problem, and I have compiled them all into a benchmarking answer for performance tests.

Wolfie
  • 27,562
  • 7
  • 28
  • 55
  • A conventional loop instead of `cellfun` might save a little bit time. (although *funs have got faster in newer versions) – Sardar Usama Jan 18 '18 at 10:19
  • Can't you use `isequal` instead of `strcmp`? – rahnema1 Jan 18 '18 at 10:32
  • @rahnema1 `isequal` is indeed a bit quicker, but that's because it doesn't give a column-wise output, it is simply `true` or `false` for the entire array. – Wolfie Jan 18 '18 at 11:07

5 Answers5

3

To be tested if it can be faster for very large arrays but maybe something like this:

function [b] = IsTypeConsistentColumns(myCell)
%[
    b = true;
    try
        for ci = 1:size(myCell, 2)
           cell2mat(myCell(:, ci));
        end
    catch err
        if (strcmpi(err.identifier, 'MATLAB:cell2mat:MixedDataTypes'))
            b = false;
        else
            rethrow(err);
        end
    end
%]
end

It depends on how fast cell2mat is compared to your string comparison (even is result of cell2mat is not used here.

Note that cell2mat will throw an error if type is not consistent (identifier: 'MATLAB:cell2mat:MixedDataTypes', message = 'All contents of the input cell array must be of the same data type.')

EDIT: limiting to cellfun('isclass', c , cellclass) test

Here only using type consistence check that is internally performed in cell2mat routine:

function [consistences, types] = IsTypeConsistentColumns(myCell)
%[
    ncols = size(myCell, 2);
    consistences = false(1, ncols);
    types = cell(1, ncols);
    for ci = 1:ncols
        cellclass = class(myCell{1, ci});
        ciscellclass = cellfun('isclass', myCell(:, ci), cellclass);

        consistences(ci) = all(ciscellclass);
        types{ci} = cellclass; 
    end    
%]
end

With you test case myCell = repmat( { 1, 2, 3, 'test', 1 , 'abc'; 4, 5, 6, 'foob', 'a', 'def' }, 10000, 5 );,

It takes about 0.0123 seconds on my computer with R2015b ... It could even be faster if you want to fail on first non consistent column (here I'm testing them all)

CitizenInsane
  • 4,755
  • 1
  • 25
  • 56
  • All elements of a `cell2mat` argument must be the same data type, which is clearly not the case from my example and description, where the types aren't even necessarily numerical (as needed for a matrix). – Wolfie Jan 18 '18 at 09:56
  • @Wolfie Editing the code of `cell2mat`, type consistance check is performed at the very beginning of the routine with `cellfun('isclass', myCell, cellclass)` where `cellclass` is the class of first element (nb: working column by column in my code). So this should not be limited to numerical types only. Anyway I don't know if `isclass` test will be any faster than string comparison. – CitizenInsane Jan 18 '18 at 10:29
  • I was unclear, they have to be consistent size, or numerical. For instance `cell2mat({'abc'; 'de'})` will fail, since you can't have a character array `[abc;de]`, the dimensions are not consistent – Wolfie Jan 18 '18 at 10:37
  • Yes but again type consistency is performed before conversion to matrix + I'm testing for error identifier ... I will test with just `isclass` from your benchmark code – CitizenInsane Jan 18 '18 at 10:40
  • @Wolfie I illustrated my purpose with some edit to my post – CitizenInsane Jan 18 '18 at 11:09
3

This is a collection of the different suggestions with a benchmarking script to compare timings...

function benchie    
    % Create a large, mixed type cell array
    myCell = repmat( { 1, 2, 3, 'test',  1 , 'abc';
                       4, 5, 6, 'foob', 'a', 'def' }, 10000, 5 );

    % Create anonymous functions for TIMEIT               
    f1 = @() usingStrcmp(myCell);
    f2 = @() usingUnique(myCell);
    f3 = @() usingLoops(myCell);
    f4 = @() usingISA(myCell);
    f5 = @() usingIsClass(myCell);
    % Timing of different methods
    timeit(f1)
    timeit(f2)
    timeit(f3)    
    timeit(f4)
    timeit(f5)
end

function usingStrcmp(myCell)
    % The original method
    types = cellfun( @class, myCell, 'uni', false );
    typesOK = all( strcmp(repmat(types(1,:), size(types,1), 1), types), 1 );
    types = types(1, :);
end

function usingUnique(myCell)
    % Using UNIQUE instead of STRCMP, as suggested by rahnema1 
    types = cellfun( @class, myCell, 'uni', false );
    [type,~,idx]=unique(types);
    u = unique(reshape(idx,size(types)),'rows');
    if size(u,1) == 1
        % consistent
    else
        % not-consistent
    end
end

function usingLoops(myCell)
    % Using loops instead of CELLFUN. Move onto the next column if a type
    % difference is found, otherwise continue looping down the rows
    types = cellfun( @class, myCell(1,:), 'uni', false );
    typesOK = true(size(types));
    for c = 1:size(myCell,2)
        for r = 1:size(myCell,1)
            if ~strcmp( class(myCell{r,c}), types{c} )
                typesOK(c) = false;
                continue
            end
        end
    end
end

function usingISA(myCell)
    % Using ISA instead of converting all types to strings. Suggested by Sam
    types = cellfun( @class, myCell(1,:), 'uni', false );
    for ii = 1:numel(types)
       typesOK(ii) = all(cellfun(@(x)isa(x,types{ii}), myCell(:,ii)));
    end
end

function usingIsClass(myCell)
    % using the same method as found in CELL2MAT. Suggested by CitizenInsane 
    ncols = size(myCell, 2);
    typesOK = false(1, ncols);
    types = cell(1, ncols);
    for ci = 1:ncols
        cellclass = class(myCell{1, ci});
        ciscellclass = cellfun('isclass', myCell(:, ci), cellclass);
        typesOK(ci) = all(ciscellclass);
        types{ci} = cellclass; 
    end  
end

Outputs:

Tested on R2015b

usingStrcmp:  0.8523 secs
usingUnique:  1.2976 secs
usingLoops:   1.4796 secs
usingISA:    10.2670 secs 
usingIsClass: 0.0131 secs % RAPID!

Tested on R2017b

usingStrcmp:  0.8282 secs
usingUnique:  1.2128 secs
usingLoops:   0.4763 secs % ZOOOOM! (Relative to R2015b)
usingISA:     9.6516 secs
usingIsClass: 0.0093 secs % RAPID!

The looping method will depend heavily on where the type discrepancy occurs, since it could loop over every row of every column or just 2 rows of every column.

With the same inputs though (as shown), the looping has been massively optimised in the newer version of MATLAB (2017b), saving >65% time, and 50% quicker than the original!


Conclusions:

  • For consistently quick times (regardless of input), the original method is still winning.
  • For top speed on newer MATLAB releases, the looping method may be optimal.

  • Update: The method proposed by CitizenInsane is extremely quick compared to other versions, and is likely hard to beat since it uses the same methodology found in Matlab's own cell2mat.

    Recommendation: use the above usingIsClass function.

Wolfie
  • 27,562
  • 7
  • 28
  • 55
1

You can use unique:

myCell = { 1, 2, 3, 'test',  1 , 'abc';
            4, 5, 6, 'foob', 'a', 'def' };

types = cellfun( @class, myCell, 'uni', false );
[type,~,idx]=unique(types);
u = unique(reshape(idx,size(types)),'rows');
if size(u,1) == 1
    disp('consistent')
else
     disp('non-consistent')
end
rahnema1
  • 15,264
  • 3
  • 15
  • 27
  • In a quick benchmark, this appears to be ~30% slower than my method, although it's neat. I just did some profiling, and it's the `types = cellfun (@class, ...)` line which takes over 90% of the time. Unfortunately our solutions both have that line in common! I'll add this useful information to the question – Wolfie Jan 18 '18 at 10:12
1

How about this:

>>  myCell = { 1, 2, 3, 'test',  1 , 'abc';
               4, 5, 6, 'foob', 'a', 'def' }
myCell =
  2×6 cell array
    [1]    [2]    [3]    'test'    [1]    'abc'
    [4]    [5]    [6]    'foob'    'a'    'def'

>> firstRowTypes = cellfun(@class, myCell(1,:), 'uni', false)
firstRowTypes =
  1×6 cell array
    'double'    'double'    'double'    'char'    'double'    'char'

>> for i = 1:numel(firstRowTypes)
       typesOK(i) = all(cellfun(@(x)isa(x,firstRowTypes{i}), myCell(:,i)));
   end

>> typesOK
typesOK =
  1×6 logical array
   1   1   1   1   0   1

I haven't done extensive timings, but I think that should speed things up (at least for large cell arrays), as

  1. you only convert the first row's types into strings
  2. you're making the type comparisons directly using isa, rather than converting all the types to strings and then comparing strings.
Sam Roberts
  • 23,951
  • 1
  • 40
  • 64
  • Good thought Sam, but a quick [benchmark](https://stackoverflow.com/a/48319049/3978545) shows that this is considerably slower than the original (~10x). I assume `isa` has a lot more internal checks than just my crude class grabbing and string comparison. – Wolfie Jan 18 '18 at 10:35
  • Oh well, worth a try. Sorry I didn't do the timings myself :) – Sam Roberts Jan 18 '18 at 10:42
  • No worries, that's why I put the benchie script/answer together always good to see other approaches. – Wolfie Jan 18 '18 at 10:43
0

You could try validateattributes, but you have to state the class explicitly and specify which columns to check. But you could easily get that from the first row of your cell array, for example.

foo = { 'abc' 5 'def'; ...
        'foo' 6 'bar' };

cellfun(@(x) validateattributes(x, {'char'}, {}, foo(:, [1 3]))
  • **1.** I think you have a typo, you haven't closed the brace in the second `validateattributes` argument. **2.** This still uses `cellfun` to loop over every element, and just seems *more* complicated because I have to get all unique classes and their column indices to repeatedly call your main code line and check types. That just replaces my `strcmp` with lots of looping... **3.** `validateattributes` will cause an error if not valid, which is undesirable, I would rather just be able to state the column is mixed type – Wolfie Jan 18 '18 at 09:49