I am using podman to run a model on Azure. I need to run the model many times so i am sending these jobs to additional nodes that are managed by SLURM.
The podman command i use is:
podman run --rm --mount "type=bind,src=/inputfiles/,dst=/app/model" --mount "type=bind,src=/outputfiles,dst=/app/model/out" modelImage runscript.sh
This mounts local directories for input files and output files. The image is modelImage
and runscript.sh
is the script that is executed by the CMD command in my Dockerfile. The Dockerfile just installs a linux distribution, libraries, model code, and compiles code
The last part of my Dockerfile reads
ENTRYPOINT ["sh"]
CMD ["runmodel.sh"]
So if i run the podman command above without the last argument, then runmodel.sh
is executed, otherwise it is overridden with runscript.sh
This all works just fine, when i run my model on the main node (the controller) of my cluster.
However if i submit a job to an additional node then i get the following error:
Error: container_linux.go:349: starting container process caused "exec: "runscript.sh": executable file not found in $PATH": OCI runtime command not found error
I am not sure what this means exactly. Does it mean that /bin/bash can't be found in $PATH on the node? How should i be thinking about troubleshooting/resolving this?
It seems like the environment is different on new nodes than that of the controller. Thanks
EDIT: Both runscript.sh
and runmodel.sh
are provided by the input bind mount. They do not exist in the image. The version of podman i am using is 1.6.4 (bundled with centos7). I am trying to get 4.5.0 installed to determine if this is the reason.