How to pre-allocate a table with non-scalar sized variables?

Question

I was playing around with tables as a replacement for regular numerical arrays for various reasons, when I came across the following challenge: how to (pre-)allocate a table with non-scalar variables?

Given a loop like so:

function A = myfun(...)
N = large number
A = zeros(N,4);

for i = 1:N
   do stuff
   A(i,:) = [scalar, vector];
end

I want to instead return a table with named variables.

I could simply rewrite it to say:

function T = myfun2(...)
N = large number
A = zeros(N,4);

for i = 1:N
   do stuff
   A(i,:) = [scalar, vector];
end
T = table(A(:,1), A(:,2:end),'VariableNames',{'scalar','vector'});

which obviously yields a table with the format:

T =

  N×2 table

    scalar      vector   
    ______    ___________

      0       0    0    0
      0       0    0    0
      0       0    0    0
     ...          ...

Now, if I instead wanted to pre-allocate the output table and update it for every iteration I would try something along the lines of:

function T = myfun3(...)
N = large number
T = table('Size',[N,2],...
       'VariableTypes',{'double','double'},...
       'VariableNames',{'scalar', 'vector'});

for i = 1:N
   do stuff
   T(i,:) = {scalar, vector};
end

The problem with myfun3 is that the format of T is:

T =

  N×2 table

    scalar    vector
    ______    ______

      0         0   
      0         0   
      0         0

So clearly the variable 'vector' is now scalar instead of an array/vector. Reading from the table documentation it does not seem like the 'size' type pre-allocation can take in array sizes?

Q1: How does one go about pre-allocating a table with non-scalar variables?

Q2: If A in myfun2 is large, is the overhead bad or is this an acceptable solution?

I have concerns that the extra overhead of indexing into/out-of a table are exceedingly large compared to a numerical array that it will adversely effect performance code.

======= EDIT =======

I contacted MathWorks and they confirmed that as of MATLAB R2019b there is no way of achieving Q1 with the size parameter.

As indicated in the answer by @Bentoy13, although you can create a table and loop over it, that can be very slow (possibly orders of magnitude slower than populating an array and then creating the table at the end.) I'm a big fan of tables, but they have to be used with case. — Phil Goddard, Sep 28 '19 at 14:17

Bentoy13 · Accepted Answer · 2019-09-27T11:58:01.063

You can create the table before the for-loop, then access it by column names:

function T = myfun2(...)
N = large number
A = zeros(N,4);
T = table(A(:,1), A(:,2:end),'VariableNames',{'scalar','vector'});
for i = 1:N
   do stuff
   T.scalar(i,:) = scalar_i;
   T.vector(i,:) = vector_i;
   % or in one line: T(i,:) = table(scalar_i, vector_i);
end

I am not sure that creating a little table each iteration is efficient, so maybe prefer accessing one column at a time.

NOTE

As Juhl pointed out in comments, there may be double allocation using temporary objects for creating a table, whereas with the 'Size' argument, you can expect that there is only one chunk of data allocated.

So let's check this. On my computer, using Matlab 2019a, there is :

>> memory
Maximum possible array:       56239 MB (5.897e+10 bytes) *

So I can allocate 56.239e9 / 8 = 7.0299e9 elements in a single array (knowing that doubles are on 8 bytes). Let's round up, and say that I want to create an table with one column of more that a half of this (3.51e9 elements):

>> T = table(zeros(4e9,1));
>> memory
Maximum possible array:       33644 MB (3.528e+10 bytes)

It takes a long time to allocate, but finishes. With 'Size', it is exactly the same:

>> T = table(zeros(4e9,1));
>> memory
Maximum possible array:       33677 MB (3.531e+10 bytes) *

So it appears that we don't have double allocation.

There is one fun fact: the memory taken by T is less than we can expect. If I try to modify the last element of my table, it appears that it consumes memory up to the expected memory size:

>> T.Var1(end) = 1;
>> memory
Maximum possible array:       27574 MB (2.891e+10 bytes)

DISCLOSURE

Please note that modifying this kind of table takes time:

>> tic; T.Var1(end) = 1; toc
Elapsed time is 33.286967 seconds.

So my conclusion is: work with normal arrays, it is A LOT faster:

>> tic; T = table('Size', [4e9, 1], 'VariableTypes',{'double'}); toc
Elapsed time is 15.997680 seconds.
>> tic; T.Var1(end) = 1; toc
Elapsed time is 33.286967 seconds.
>> clear T;

>> tic; A = zeros(4e9,1); toc
Elapsed time is 0.043366 seconds.
>> tic; A(end) = 1; toc
Elapsed time is 0.002430 seconds.
>> clear A;

Yes that will work, but then you are not using the 'size' argument of `table`, I'm assuming that the performance hit isn't too big, i.e. it won't allocate both blocks of data? — Juhl, Sep 27 '19 at 07:54
Thanks for the performance update. Modifying such a table is much worse than I thought! Having a memory graph (Win: Resource monitor) open while allocating and modifying shows and the memory is fluctuating, which is probably where the lack of performance comes from. — Juhl, Oct 03 '19 at 11:54

How to pre-allocate a table with non-scalar sized variables?

1 Answers1