0

This is a cross-post from here: Link to post in the Mathworks community

Currently I'm working with large data sets, I've saved those data set as matlab files with the two biggest files being 9.5GB and 5.9GB. These files contain a cell array each of 1x8 (this is done for addressibility and to prevent mixing up data from each of the 8 cells and I specifically wanted to avoid eval). Each cell then contains a 3D double matrix, for one it's 1001x2002x201 and the other it is 2003x1001x201 (when process it I chop of 1 row at the end to get it to 2002).

Now I'm already running my script and processing it on a server (64 cores and plenty of RAM, matlab crashed on my laptop, as I need more than 12GB ram on windows). Nonetheless it still takes several hours to finish running my script and I still need to do some extra operations on the data which is why I'm asking advice.

For some of the large cell arrays, I need to find the maximum value of the entire set of all 8 cells, normally I would run a for loop to get the maximum of each cel and store each value in a temporay numeric array and then use the function max again. This will work for sure I'm just wondering if there's a better more efficient way.

After I find the maximum I need to do a manipulation over all this data as well, normally I would do something like this for an array:

B=A./maxvaluefound;
A(B > a) = A(B > a)*constant;

Now I could put this in a for loop, adress each cell and run this, however I'm not sure how efficient that would be though. Do you think there's a better way then a for loop that's not extremely complicated/difficult to implement?

There's one more thing I need to do which is really important, each cell as I said before is a slice (consider it time), while inside each slide is the value for a 3D matrix/plot. Now I need to integrate the data so that I get more slices. The reason I need to do this that I need to create slices/frames/plots to create a movie/gif. I'm planning on plotting the 3d data using scatter3 where this data is represented by color. I plan on using alpha values to make it see through so that one can actually see the intensity in this 3d plot. However I understand how to use griddata but apparently it's quite slow. Some of the other methods where hard to understand. Thus what would be the best way to interpolate these (time) slices in an efficient way over the different cells in the cell array? Please explain it if you can, preferably with an example.

I've added a pic for the Linux server info I'm running it on below, note I can not update the matlab version unfortunately, it's R2016a: Specs_server

I've also attached part of my code to give a better idea of what I'm doing:

if (or(L03==2,L04==2)) % check if this section needs to be executed based on parameters set at top of file
    load('../loadfilewithpathnameonmypc.mat')
    E_field_650nm_intAll=cell(1,8); %create empty cell array
    parfor ee=1:8 %run for loop for cell array, changed this to a parfor to increase speed by approximately 8x
        E_field_650nm_intAll{ee}=nan(szxit(1),szxit(2),xres); %create nan-filled matrix in cell 1-8
        for qq=1:2:xres
            tt=(qq+1)/2; %consecutive number instead of spacing 2
            T1=griddata(Xsall{ee},Ysall{ee},EfieldsAll{ee}(:,:,qq)',XIT,ZIT,'natural'); %change data on non-uniform grid to uniform gridded data
            E_field_650nm_intAll{ee}(:,:,tt)=T1; %fill up each cell with uniform data
        end
    end
    clear T1
    clear qq tt
    clear ee
    save('../savelargefile.mat', 'E_field_650nm_intAll', '-v7.3')
end


if (L05==2) % check if this section needs to be executed based on parameters set at top of file
    if ~exist('E_field_650nm_intAll','var') % if variable not in workspace load it
        load('../loadanotherfilewithpathnameonmypc.mat');
    end

    parfor tt=1:8 %run for loop for cell array, changed this to a parfor to increase speed by approximately 8x
        CFxLight{tt}=nan(szxit(1),szxit(2),xres); %create nan-filled matrix in cells 1 to 8
        for qq=1:xres
            CFs=Cafluo3D{tt}(1:lxq2,:,qq)'; %get matrix slice and tranpose matrix for point-wise multiplication
            CFxLight{tt}(:,:,qq)=CFs.*E_field_650nm_intAll{tt}(:,:,qq); %point-wise multiple the two large matrices for each cell and put in new cell array
        end
    end
    clear CFs
    clear qq tt
    save('../saveanotherlargefile.mat', 'CFxLight', '-v7.3')
end
Bob van de Voort
  • 211
  • 1
  • 11
  • 1
    you shouldn't concern too much about those 8 cells (if you use a `for` loop or `cellfun` doesn't make much difference here). Think twice about storing the data (maybe casting it to single-precision?) and which operations are necessary. Read/ write slows you down, `clear` as well but helps keeping the memory clean (better call functions)... Can you shrink the 3D matrices? ... It's really hard to follow your code as it is ugly and with very specific names (and not reproducible). Sorry we can just guess here – max Mar 26 '20 at 20:55
  • I don't think I can shrink the 3D matrices (at least nothing I know about). What do you mean with read/write specifically, do you mean saving and loading or something else? The code itself is also not critical but let me try to beautify it a bit. – Bob van de Voort Mar 26 '20 at 21:27
  • 1
    `parfor` means that each MATLAB process gets its own copy of the data. It might not be worth while doing the loop in parallel. On the other hand, you could consider loading only one big array at the time. Each of them is 3GB in memory, it would be worth while to figure out how to not need to load them all at once. – Cris Luengo Mar 27 '20 at 00:29
  • So regarding slow loading/saving (which I assume max meant), I've put in place a few things to try and help with that. I set the warning that a file cannot be saved to an error and use try/catch to first try to save it as v7 and then v7.3. I've also noticed that v7.3 is causing part of the issues here, much slower loading/saving and MUCH bigger file sizes. Haven't found any good work arounds as HDF5 is like chinese to me, I don't understand it. – Bob van de Voort Mar 27 '20 at 02:08
  • 1
    yes, reading/writing are the more commonly used universal terms for loading/saving;) Quick questions & comments: 1) do you really need uniformly gridded data? 2) clearing loop-control variables (size 1x1) is not worth it. 3) introduce a couple of `tic`-`toc`s to see where your problem really is 4) since you only save/load once, this won't save you much time. Honestly, I don't see much heavy computing here. 5) @CrisLuengo is right, get rid of `parfor` 6) you maning is odd (`tt`, `qq`, `E_field_650nm_intAll`, really? Not helpful for others...) and do we need to see the second `if`-clause? – max Mar 27 '20 at 06:46
  • Hey max thank you very much. It seems the problem is mostly with loading and saving files that need to be saved with 'v7.3'. I'm currently looking into this. I'll probably give a more extensive response tomorrow. – Bob van de Voort Mar 27 '20 at 17:42

0 Answers0