1

My organization is usually a few years behind the most recent Matlab version. I am finding that splitapply is extremely slow when there are many groups (two numerical grouping variables), in sharp contrast to my experience with SQL. I suspect that it loops through all the groups. Whatever the cause, for long term planning purposes, I was wondering if anyone can comment on whether this is a problem in post-2015b Matlab versions?

Here is some benchmarking code. The results aren't as extreme as for my real problem, but it still shows the disparity in execution time.

clear all

% This data is more like mine and takes splitapply 3+ minutes to run
%-------------------------------------------------------------------
% tDat = array2table( ...
%     floor([ rand(2e6,1) 20e3*rand(2e6,1) 50*rand(2e6,1) ]) , ...
%     'VariableNames' , {'data2add','groupVar1','groupVar2'} );

% This data runs way faster than mine, but still illustrates the problem
%------------------------------------------------------------------------
tDat = array2table( floor(100*rand(2e6,3)) , ...
                    'VariableNames' , ...
                    {'data2add','groupVar1','groupVar2'} );

[ G , tRollup ] = findgroups( tDat(:,{'groupVar1','groupVar2'}) );

tic
tRollup.total_sa = splitapply( @sum, tDat.data2add, G );
disp('Done splitapply')
toc
fprintf('\n')

tic
tRollup.total_aa1 = accumarray( G, tDat.data2add );
disp('Done accumarray #1')
toc
fprintf('\n')

tic
tRollup.total_aa2 = accumarray( G, tDat.data2add, [], @sum );
disp('Done accumarray #2')
toc
fprintf('\n')

% Confirm that results are equivalent

if isequal( tRollup.total_sa , tRollup.total_aa1 )
   disp('tRollup.total_sa == tRollup.total_aa1')
else
   disp('tRollup.total_sa ~= tRollup.total_aa1')
end

if isequal( tRollup.total_aa1 , tRollup.total_aa2 )
   disp('tRollup.total_aa1 == tRollup.total_aa2')
else
   disp('tRollup.total_aa1 ~= tRollup.total_aa2')
end

The output is:

Done splitapply
Elapsed time is 2.550241 seconds.

Done accumarray #1
Elapsed time is 0.021673 seconds.

Done accumarray #2
Elapsed time is 0.020397 seconds.

tRollup.total_sa == tRollup.total_aa1
tRollup.total_aa1 == tRollup.total_aa2
user36800
  • 2,019
  • 2
  • 19
  • 34
  • Can you use `accumarray` instead of `splitapply`? I think they are pretty similar – Luis Mendo May 14 '18 at 20:32
  • Thanks, Luis. I can use `accumarray` for simple functions to be applied to the data. What took `splitapply` 2.5 minutes was instantaneous of `accumarray`. But `accumarray` is only for applying simple functions to data. `splitapply` can apply functions that take multiple inputs and generates multiple outputs. – user36800 May 15 '18 at 12:01
  • For `accumarray`, I'd have to perform all the operations between variables separately, the multiple aggregations separately, and the operations between the individual aggregation results separately. It will still be faster compared to `splitapply`, but the code starts to bloat (compared to SQL). It's still a practical option for those cases where there are many groups, but I'm hoping to hear that Matlab 2018 will have addressed this problem....otherwise, it may still be worthwhile going through the extra code complexity to connect with Access, which doesn't thrill me. – user36800 May 15 '18 at 12:02
  • I see. So that’s the added functionality of `splitapply` compared to `accumarray`: multiple outputs. Can’t you pass `accumarray` a function handle to your complicated function and have this function return a cell array packing all results? – Luis Mendo May 15 '18 at 12:50
  • I can, but I need extra code to associate the results with the different columns/variables/fields in the output table/struct. `splitapply` also allows for packing multiple results into an array, but I've managed to avoid explicitly unpacking (e.g., http://stackoverflow.com/questions/50167001/matlab-explicit-or-named-association-of-splitapply-arguments-with-table-varia). As an alternative to SQL, I'm trying to execute operations with as little code bloat as possible. Basically, I'm use to doing SQL operations with little thought, and the Matlab alternative seems to a monster of cognitive load. – user36800 May 15 '18 at 14:06
  • I studied the `accumarray` help in detail. It isn't designed to handle multiple columns of input data. So operations between corresponding elements in different input columns have to be done before `accumarray`; operations done on the results of applying `accumarray` to different columns must be done afterward. Where speed is not a problem, `splitapply` seems perferrable. Where it is, it's a toss-up whether `accumarray` is better than transferring the data to Access. For now, I prefer sticking with one environment for all operations. Two environments, twice the risk of something going wrong. – user36800 May 15 '18 at 20:55
  • If you want an answer regarding whether or not "this is still a problem", I would suggest adding some self-contained benchmarking code to your question (including the result of a run on your computer) so that people can run it on their machines (presumably with newer MATLABs) and compare. This point is important because it is possible that you have some edge case for which `splitapply` is very inefficient, which the rest of us are extremely likely to miss if we start guessing our own test data. – Dev-iL May 22 '18 at 17:29
  • Thanks, Dev-iL. I've added the test code. – user36800 May 23 '18 at 02:02

0 Answers0