1

I'm running some distributed training on some platform using MPI. During the training I saw massive printings like:

Read -1, expected 5017600, errno = 1
Read -1, expected 5017600, errno = 1
Read -1, expected 5017600, errno = 1
Read -1, expected 5017600, errno = 1
Read -1, expected 5017600, errno = 1
...

After some investigation I found that it is caused by the default docker Seccomp. If I run the docker with --cap-add=SYS_PTRACE those massive printing will go away.

However, I can not add flag for docker run since I can't control the launching of docker images: they are launched by the platform. So, is there a way to modify ptrace setting in either Dockerfile or inside the docker container?

Another finding is that running MPI with btl_vader_single_copy_mechanism none will disable these prints but the performance will be harmed, so that is not an option.

Any help will be very appreciated!

  • 1
    this is unlikely since that could be a security flaw. that being said, did you bench your app without the single copy mechanism? if you are trying to squeeze all the performance you can, then you should consider running on bare metal instead of from a docker container. If you are simply looking for a container solution, `singularity` is a much better fit for HPC compared to docker (and I am pretty sure you would not have to disable the single copy mechanism). – Gilles Gouaillardet May 27 '20 at 07:41
  • 1
    Premature opitimisation is the root of all evil. Have you measured the actual impact of disabling the CMA mechanism for `vader` or are you purely speculating? – Hristo Iliev May 27 '20 at 08:59
  • @HristoIliev Yeah I tried benchmark the performance with `btl_vader_single_copy_mechanism none`, it is much slower. – user3391299 May 27 '20 at 16:49
  • It would seem then that Docker is not your friend. – Hristo Iliev May 27 '20 at 17:53

0 Answers0