2

I am looking for a way to run simple parallel processes (one function run multiple times with different arguments, no communication between process) across multiple nodes in a PBS cluster.

Currently I am able to run it on a single node setting the number of threads with an environment variable in the PBS script, and using a for loop with @thread.threads

I have found references to clustermanager.jl, but no clear working example on how to use it on PBS. For example: does addprocs_pbs in the file take care also of the script part, or do I still need to run a pbs script as usual, and this function is called inside the julia file?

This is the code structure I am using now. Ideally, it would stay more or less the same but parallel process could run across multiple nodes.

using JLD
include("path/to/library/with/function.jl")

seed = 342;
n = 18; # number of simulations

changing_parameter = [1,2,3,4];

input_file = "some file"
CSV.read(string(input_files_folder,input_file));

# I should also parallelise this external for loop
# it currently runs 18 simulations per run, and saves the results each time
for P in changing_parameter

    Random.seed!(seed);
    seeds = rand(1:100000,n)

    results = []
    Threads.@threads for i = 1:n
        push!(results,function(some_fixed_parameters, P=P, seed=seeds[i]);)
    end
        
    # get the results
    # save the results
    JLD.save(filename,to_save,compress=true)


end
tidus95
  • 359
  • 2
  • 14

1 Answers1

1

For distributed computing you normally need to use multiprocessing rather than multi-threading (although it is OK to have multi-threaded parallel processes if you need).

Hence, what you need to do is to use the ClustersManagers library to use the cluster manager to allocate processes for your Julia cluster.

I have been using Julia with Cray clusters using SLURM so not exactly PBS, however I since your question remain unanswered here is my working code. You will use addprocs_pbs that looks to have a very similar structure.

using ClusterManagers
addprocs_slurm(36,job_name="jobname", account="some_acc_name", time="01:00:00", exename="/lustre/tetyda/home/pszufe/julia/usr/bin/julia")

Once you add the worker processes all what remains is to use the Distributed package to orchestrate your workload.

Przemyslaw Szufel
  • 40,002
  • 3
  • 32
  • 62
  • 1
    thank you! My understanding is that ```addprocs_slurm``` does the work of a pbs/slurm script right? Does this mean I can simply launch the julia script from terminal, or do I still need to prepare a pbs script reserving the right number of nodes and processes ecc. ? – tidus95 Dec 16 '20 at 10:38
  • 1
    In case of `addprocs_slurm` it did the full work of Slurm so I did not need to do anything else. I expect the same for PBS. Just note that in some computers worker nodes have different hardware than access node - so in my case the most tedious thing was to compile Julia using worker nodes with their ancient library configuration. I had to provide that julia via `exename` param but did not need to write the SLURM stuff. – Przemyslaw Szufel Dec 16 '20 at 13:02