Regards,
I apologize before hand for the lengthy post:
My question: How do you modify the loop between * * * in errFunction
, the runABQfile
function (subprocess.call
), and the bash script
below so that I can run a PSO optimization in a cluster?
The Background: I am calibrating a model using Particle Swarm Optimization (PSO) written in Python and ABAQUS with VUMAT (user material). A python script updates the input files N different ABAQUS models (which correspond to N different experiments) for each iteration and should run each of the N models until the global error between experiments and models is minimized. I am running this optimization in a cluster where I do not have admin privileges.
Assume I have a working main script main.py that import necessary modules, initiates variables, read the experimental data before calling a function PSO.py using
XOpt, FOpt = pso(errFunction, lb, ub, f_ieqcons=mycons, args=args)
The target function errFunction
to be minimized is to run all N models using the runABQfile
function and return the global error each iteration to the PSO function. A brief view of the structure of my code is shown below (I left out parts that are not relevant).
def errFunction(param2Calibrate,otherProps,InputFiles,experimentData,otherArgs):
maxNpts = otherArgs[0]
nAnalysis = otherArgs[1]
# Run Each Abaqus Simulation
inpFile = [0] * nAnalysis
abqDisp = [[0 for x in range(maxNpts)] for y in range(nAnalysis)]
abqForce = [[0 for x in range(maxNpts)] for y in range(nAnalysis)]
iexpForce = [[0 for x in range(maxNpts)] for y in range(nAnalysis)]
# ***********************************#
# - Update and Run Each Input File - #
for r in range(nParallelLoops):
for k in range( r*nAnalysis/nParallelLoops, (r+1)*nAnalysis/nParallelLoops ):
# - Write and Run Abaqus INP file - #
inpFile[k] = writeABQfile(param2Calibrate,otherProps[k],InputFiles[k])
runABQfile(inpFile[k])
# - Extract from Abaqus ODB - #
abqDisp_, abqForce_ = extraction(inpFile[k])
abqDisp[k][0:len(abqDisp_)] = abqDisp_
abqForce[k][0:len(abqForce_)] = abqForce_
# ***********************************#
# - Interpolate Experimental Results to Match Abaqus - #
for k in range(0,nAnalysis):
iexpForce_ = interpolate(experimentData[k],abqDisp[:][k])
iexpForce[k][0:len(abqDisp_)]= iexpForce_
# - Get Error - #
for k in range(0,nAnalysis):
Err[k] = Error(iexpForce[:][k],abqDisp[:][k],abqForce[:][k])
return Err
And the runABQfile
is setup as follow, where 2 processes are to run in serie:
def runABQfile(inpFile):
import subprocess
import os
# - Run Abaqus - #
ABQexe = '/opt/abaqus/6.14-1/code/bin/abq6141'
prcStr1 = (ABQexe+' '+'job='+inpFile+' input='+inpFile+' \
user=$HOME/mPDFvumatNED.f scratch=/scratch/$USER/$SLURM_JOBID \
cpus=12 parallel=domain domains=12 mp_mode=mpi memory=60000mb \
interactive double=both')
prcStr2 = (ABQexe+' '+'cae noGUI='+inpFile+'_CAE.py')
process = subprocess.call(prcStr1,stdin=None,stdout=None,stderr=None,shell=True)
process = subprocess.call(prcStr2,shell=True)
Where the problem seem to be: I have access to maximum 2 nodes with 24 cpus per job (restricted by # of ABAQUS licenses). If I were to run a single analysis, I'd queue the job using SLURM with the following script.
#!/bin/bash
#SBATCH --job-name="abaqus"
#SBATCH --output="abaqus.%j.%N.out"
#SBATCH --partition=debug
#SBATCH --nodes=2
#SBATCH --export=ALL
#SBATCH --ntasks-per-node=24
#SBATCH -L abaqus:25
#SBATCH -t 00:30:00
#Get the env file setup
scontrol show hostname > file-list1
scontrol show hostlist > file-list2
HOST1=`sed -n '1p' file-list1`
HOST2=`sed -n '2p' file-list1`
cat abq_v6.env |sed -e "s/host1/$HOST1/g" > ttt1.env
cat ttt1.env | sed -e "s/host2/$HOST2/g" > abaqus_v6.env
rm ttt*env
#Run the executable remotely
sed "s/DUMMY/$SLURM_JOBID/g" s4b.sh.orig > s4b.sh
chmod u+x s4b.sh
export EXHOST=`/bin/hostname`
ssh $EXHOST $SLURM_SUBMIT_DIR/s4b.sh
where s4b.sh.orig
looks like this:
#!/bin/bash -l
cd /share/apps/examples/ABAQUS/s4b_multinode
module purge
module load abaqus/6.14-1
export EXE=abq6141
$EXE job=s4b scratch=/scratch/$USER/DUMMY cpus=48 -verbose 3 \
standard_parallel=all mp_mode=mpi memory=120000mb interactive
This script setup is the only way to to submit one ABAQUS job that runs on multiple nodes on that cluster because of problems with the ABAQUS environment file and SLURM (my guess the mp_host_list is not being properly assigned or it is oversubscribed, but honestly I do not understand what could be going on).
I modified my runABQfile
function to use the bash construct when calling subprocess.call
to something like this:
prcStr1 = ('sed "s/DUMMY/$SLURM_JOBID/g" s4b.sh.orig > s4b0.sh; \
sed "s/MODEL/inpFile/g" s4b0.sh > s4b1.sh; \
chmod u+x s4b1.sh; \
export EXHOST=`/bin/hostname`; \
ssh $EXHOST $SLURM_SUBMIT_DIR/s4b1.sh' )
process = subprocess.call(prcStr1,stdin=None,stdout=None,stderr=None,shell=True)
But the optimization never starts and quits right after modifying the first script.
Now the question again is How do you modify the loop between * * * in errFunction
, the runABQfile
function (subprocess.call), and the bash script so that I can run this optimization?... I would like to use at least 12 processors per ABAQUS model that is potentially running 4 jobs at the same time. Keep in mind all N models need to run and finish before moving to the next iteration.
I will appreciate any help you guys could provide.
Sincerely,
D P.