0

Is there a modeling convention to store some evolving dataset at each timestep in either rows or columns? For example if I have some data set of 10 spatial points that have a value x at each timestep t and I have 20 timesteps should I store the values of x in all columns of a row t (making a 20x10 matrix) or should I store the value of x in all rows of a column t (making a 10x20 matrix)?

I recognize that this doesn't change anything fundamentally, but I want to be consistent and figured I may see what the convention is, or if there is one at all. What are the pros / cons for either approach.

nirvana-msu
  • 3,877
  • 2
  • 19
  • 28
amoodie
  • 323
  • 2
  • 10
  • 2
    This is entirely personal preference. Some MATLAB internals (such as `plot`) assume that each column is a time-series but this is by no means standard. – Suever May 31 '16 at 17:22
  • I do a lot of M&S and the standard usage is to put time in the first column, and list the data points in additional columns. It is a personal preference, but if you are generating data that lots of other people are going to use, you are wise to consider standard usages. But, by far the most important consideration is to CLEARLY LABEL everything. – gariepy May 31 '16 at 17:47

1 Answers1

2

Assuming your concern is performance, then it depends on how you access your data. It is faster to successively access elements that are contiguous in memory. Matlab stores matrices in column-major order, so e.g. if you need to iterate over time dimension, it would be more efficient to iterate over rows of a particular column, than it is to iterate over columns of a particular row.

There's a nice article on this topic on Mathworks website - Programming Patterns: Maximizing Code Performance by Optimizing Memory Access:

Your code achieves maximum cache efficiency when it traverses monotonically increasing memory locations. Because MATLAB stores matrix columns in monotonically increasing memory locations, processing data column-wise results in maximum cache efficiency.

Consider this example. First, try successively accessing data from different columns (which would be scattered across different blocks of memory):

N = 2e4;
X = randn(N,N);
tic;
for i = 1:N
   for j = 1:N
       if X(i,j) >= 0
           X(i,j) = X(i,j) + 1;
       end
   end
end
toc;

>> Elapsed time is 29.200216 seconds.

Then the the other way around - first iterate over columns, than over rows:

N = 2e4;
X = randn(N,N);
tic;
for j = 1:N
   for i = 1:N
       if X(i,j) >= 0
           X(i,j) = X(i,j) + 1;
       end
   end
end
toc;

>> Elapsed time is 8.084906 seconds.

A striking 3.6x speed-up. The exact ratio may obviously vary depending on Matlab release and your PC, but the pattern is quite clear.

For the same reason, extracting a column vector from a matrix is faster than a extracting a row vector. Some built-in functions may also work marginally faster with columns, but you need to profile each case separately.

So it really depends on your actual code and how you deal with your timeseries. You may try both options and profile them, to see which variant results in better performance. But in general you can use the following rule of thumb when working with your timeseries data:

  • If you tend to iterate over time (which is generally the case), store timeseries in separate columns.
  • If you tend to iterate in a spatial / cross-sectional manner, store timeseries in separate rows.
Community
  • 1
  • 1
nirvana-msu
  • 3,877
  • 2
  • 19
  • 28