0

I have a dataset which includes approximately 2000 digital images. I am using MATLAB to perform some digital image processing to extract trees from the imagery. The script is currently configured to process the images in a parfor loop on n cores.

The challenge:
I have access to processing time on a University managed supercomputer with approximately 10,000 compute cores. If I submit the entire job for processing, I get put so far back in the tasking queue, a desktop computer could finish the job before the processing starts on the supercomputer. I have been told by support staff that partitioning the 2000 file dataset into ~100 file jobs will significantly decrease the tasking queue time. What method can I use to perform the tasks in parallel using the parfor loop, while submitting 100 files (of 2000) at a time?

My script is structured in the following way:

datadir = 'C:\path\to\input\files'
files = dir(fullfile(datadir, '*.tif'));
fileIndex = find(~[files.isdir]);

parfor ix = 1:length(fileIndex) 
     % Perform the processing on each file;
end
Borealis
  • 8,044
  • 17
  • 64
  • 112
  • 1
    In all probability I misunderstood your question. Just in case: what about `parfor i = 1:20, for j = 1:100, ..., end, end`? The data handling should be trivial... – matheburg May 28 '14 at 21:03
  • @matheburg I edited the post to clarify the structure of my current script. You are right on, I am looking for advice on restructuring the `parfor` loop. – Borealis May 28 '14 at 22:28

2 Answers2

1

Similar to my comment I would spontaneously suggest something like

datadir = 'C:\path\to\input\files'
files = dir(fullfile(datadir, '*.tif'));
files = files(~[files.isdir]);

% split up the data
N = length(files); % e.g. 20000
jobSize = 100;
jobFiles = mat2cell(files, [jobSize*ones(1,floor(N/jobSize)), mod(N,jobSize)]);
jobNum = length(jobFiles);

% Provide each job to a worker
parfor jobIdx = 1:jobNum
    thisJob = jobFiles{jobIdx}; % this indexing allows matlab for transfering
                                % only relevant file data to each worker

    for fIdx = 1:length(thisJob)
        thisFile = thisJob(fIdx);
        % Perform the processing on each file;
        thisFile.name
    end
end
matheburg
  • 2,097
  • 1
  • 19
  • 46
1

Let me try to answer the higher level question of job partitioning to optimize for supercomputer queues. I find that a good rule of thumb is to submit jobs of size sqrt(p) on a machine with p processors, if the goal is to maximize throughput. Of course, this assumes a relatively balanced queue policy, which is not implemented at all sites. But most universities don't prioritize large jobs the way DOE facilities do, so this rule should work in your case.

I don't have a mathematical theory behind my rule of thumb, but I've been a large DOE supercomputer user over the past 8 years (100M+ hours personally, allocation owner for 500M+) and I was on staff at one of the DOE sites until recently (albeit one that has a queue policy that breaks my rule).

Jeff Hammond
  • 5,374
  • 3
  • 28
  • 45
  • Hi Jeff, I have no experience with super computers, but as I stumbled over your answer, I have to ask: 1. So you assign less jobs than processors? How can this by useful? 2. I am not exactly sure how it relates to the question or my answer. Would you advice to set `jobSize = N / sqrt(numberOfProcessors)`? Probably I get something wrong :) – matheburg Apr 13 '20 at 18:26
  • What I was trying to say is, if you are using a supercomputer with 900 nodes, jobs using 30 nodes are likely to make it through the queue relatively quickly, because they are small enough to not require draining (when the machine idles nodes in order to free up a large block) but larger than smaller jobs that are submitted in very large quantity by users with codes that do not scale. – Jeff Hammond Apr 13 '20 at 18:46
  • Thanks for the fast response. So do I get it right that in our case this could mean to restrict the number of workers that we register to to `30` (to stay with your example)? – matheburg Apr 13 '20 at 18:48
  • If you are going for throughput, it is better to submit 30 jobs of 30 nodes than 1 job of 900 nodes or 900 jobs of 1 node. Not all workloads are arbitrarily decomposable, of course, but some are. – Jeff Hammond Apr 13 '20 at 20:17