Prologue
This answer is based on both the original post and the clarifications ( both ) provided by the author during the recent week.
The question of adverse performance hit(s) introduced by a low-level, physical-media-dependent, "fragmentation", introduced by both a file-system & file-access layers is further confronted both in a TimeDOMAIN magnitudes and in ComputingDOMAIN repetitiveness of these with the real-use problems of such an approach.
Finally a state-of-art, principally fastest possible solution to the given task was proposed, so as to minimise damages from both wasted efforts and mis-interpretation errors from idealised or otherwise not valid assumptions, alike that a risk of "serious file fragmentation is low" due to an assumption, that the whole file will be written in one session ( which is simply principally not possible during many multi-core / multi-process operations of the contemporary O/S in real-time over a time-of-creation and a sequence of extensive modification(s) ( ref. the MATLAB size limits ) of a TB-sized BLOB file-object(s) inside contemporary COTS FileSystems ).
One may hate the facts, however the facts remain true out there until a faster & better method moves in
First, before considering performance, realise the gaps in the concept
The real performance adverse hit is not caused by HDD-IO or related to the file fragmentation
RAM is not an alternative for the semi-permanent storage of the .mat
file
- Additional operating system limits and interventions + additional driver and hardware-based abstractions were ignored from assumptions on un-avoidable overheads
- The said computational scheme was omited from the review of what will have the biggest impact / influence on the resulting performance
Given:
The whole processing is intended to be run just once, no optimisation / iterations, no continuous processing
Data have 1E6
double
Float-values x 1E5
columns = about 0.8 TB
(+HDF5
overhead)
In spite of original post, there is no random IO associated with the processing
Data acquisition phase communicates with a .NET to receive DataELEMENT
s into MATLAB
That means, since v7.4,
a 1.6 GB limit
on MATLAB WorkSpace in a 32bit Win ( 2.7 GB with a 3GB switch )
a 1.1 GB limit
on MATLAB biggest Matrix in wXP / 1.4 GB wV / 1.5 GB
a bit "released" 2.6 GB limit
on MATLAB WorkSpace + 2.3 GB limit on a biggest Matrix in a 32bit Linux O/S.
Having a 64bit O/S will not help any kind of a 32bit MATLAB 7.4 implementation and will fail to work due to another limit, the maximum number of cells in array, which will not cover the 1E12 requested here.
The only chance is to have both
Data storage phase assumes block-writes of a row-ordered data blocks ( a collection of row-ordered data blocks ) into a MAT-file
on an HDD-device
Data processing phase assumes to re-process the data in a MAT-file
on an HDD-device, after all inputs have been acquired and marshalled to a file-based off-RAM-storage, but in a column-ordered manner
just column-wise mean()
-s / max()
-es are needed to calculate ( nothing more complex )
Facts:
- MATLAB uses a "restricted" implementation of an
HDF5
file-structure for binary files.
Review performance measurements on real-data & real-hardware ( HDD + SSD ) to get feeling of scales of the un-avoidable weaknesses thereof
The Hierarchical Data Format (HDF
) was born on 1987 at the National Center for Supercomputing Applications (NCSA), some 20 years ago. Yes, that old. The goal was to develop a file format that combine flexibility and efficiency to deal with extremely large datasets. Somehow the HDF file was not used in the mainstream as just a few industries were indeed able to really make use of it's terrifying capacities or simply did not need them.
FLEXIBILITY means that the file-structure bears some overhead, one need not use if the content of the array is not changing ( you pay the cost without consuming any benefit of using it ) and an assumption, that HDF5
limits on overall size of the data it can contain sort of helps and saves the MATLAB side of the problem is not correct.
MAT-files
are good in principle, as they avoid an otherwise persistent need to load a whole file into RAM to be able to work with it.
Nevertheless, MAT-files
are not serving well the simple task as was defined and clarified here. An attempt to do that will result in just a poor performance and HDD-IO file-fragmentation ( adding a few tens of milliseconds during write-through
-s and something less than that on read-ahead
-s during the calculations ) will not help at all in judging the core-reason for the overall poor performance.
A professional solution approach
Rather than moving the whole gigantic set of 1E12
DataELEMENT
s into a MATLAB in-memory proxy data array, that is just scheduled for a next coming sequenced stream of HDF5
/ MAT-file
HDD-device IO-s ( write-through
s and O/S vs. hardware-device-chain conflicting/sub-optimised read-ahead
s ) so as to have all the immenses work "just [married] ready" for a few & trivially simple calls of mean()
/ max()
MATLAB functions( that will do their best to revamp each of the 1E12
DataELEMENT
s in just another order ( and even TWICE -- yes -- another circus right after the first job-processing nightmare gets all the way down, through all the HDD-IO bottlenecks ) back into MATLAB in-RAM-objects, do redesign this very step into a pipe-line BigDATA processing from the very beginning.
while true % ref. comment Simon W Oct 1 at 11:29
[ isStillProcessingDotNET, ... % a FLAG from .NET reader function
aDotNET_RowOfVALUEs ... % a ROW from .NET reader function
] = GetDataFromDotNET( aDtPT ) % .NET reader
if ( isStillProcessingDotNET ) % Yes, more rows are still to come ...
aRowCOUNT = aRowCOUNT + 1; % keep .INC for aRowCOUNT ( mean() )
for i = 1:size( aDotNET_RowOfVALUEs )(2) % stepping across each column
aValue = aDotNET_RowOfVALUEs(i); %
anIncrementalSumInCOLUMN(i) = ...
anIncrementalSumInCOLUMN(i) + aValue; % keep .SUM for each column ( mean() )
if ( aMaxInCOLUMN(i) < aValue ) % retest for a "max.update()"
aMaxInCOLUMN(i) = aValue; % .STO a just found "new" max
end
endfor
continue % force re-loop
else
break
endif
end
%-------------------------------------------------------------------------------------------
% FINALLY:
% all results are pre-calculated right at the end of .NET reading phase:
%
% -------------------------------
% BILL OF ALL COMPUTATIONAL COSTS ( for given scales of 1E5 columns x 1E6 rows ):
% -------------------------------
% HDD.IO: **ZERO**
% IN-RAM STORAGE:
% Attr Name Size Bytes Class
% ==== ==== ==== ===== =====
% aMaxInCOLUMNs 1x100000 800000 double
% anIncrementalSumInCOLUMNs 1x100000 800000 double
% aRowCOUNT 1x1 8 double
%
% DATA PROCESSING:
%
% 1.000.000x .NET row-oriented reads ( same for both the OP and this, smarter BigDATA approach )
% 1x INT in aRowCOUNT, %% 1E6 .INC-s
% 100.000x FLOATs in aMaxInCOLUMN[] %% 1E5 * 1E6 .CMP-s
% 100.000x FLOATs in anIncrementalSumInCOLUMN[] %% 1E5 * 1E6 .ADD-s
% -----------------
% about 15 sec per COLUMN of 1E6 rows
% -----------------
% --> mean()s are anIncrementalSumInCOLUMN./aRowCOUNT
%-------------------------------------------------------------------------------------------
% PIPE-LINE-d processing takes in TimeDOMAIN "nothing" more than the .NET-reader process
%-------------------------------------------------------------------------------------------
Your pipe-lined BigDATA computation strategy will in a smart way principally avoid interim storage buffering in MATLAB as it will progressively calculate the results in not more than about 3 x 1E6
ADD/CMP-registers, all with a static layout, avoid proxy-storage into HDF5
/ MAT-file
, absolutely avoid all HDD-IO related bottlenecks and low BigDATA sustained-read-s' speeds ( not speaking at all about interim/BigDATA sustained-writes... ) and will also avoid ill-performing memory-mapped use just for counting mean-s and max-es.
Epilogue
The pipeline processing is nothing new under the Sun.
It re-uses what speed-oriented HPC solutions already use for decades
[ generations before BigDATA tag has been "invented" in Marketing Dept's. ]
Forget about zillions of HDD-IO blocking operations & go into a pipelined distributed process-to-process solution.
There is nothing faster than this
If it were, all FX business and HFT Hedge Fund Monsters would already be there...