5

I am attempting to open multiple sub-processes--each running the same pre-compiled binary, but operating on files in unique directories--under Python (2.7) using subprocess32.Popen(). Most of the time things work fine, but all too often I get an OSError [Errno 14] Bad Address. Here is the code:

self.gld_stdout_file = open('stdout', 'w+')
self.gld_stderr_file = open('stderr', 'w+')
...
subprocess.Popen(string.join(gld_open_str, " "), shell=True, stderr=self.gld_stderr_file,
                 stdout=self.gld_stdout_file, bufsize=-1, close_fds=ON_POSIX,
                 env={'TEMP':temp_path})

This error occurs about 5-10% of the attempts to use Popen(), while other Popen() calls in the same loop work just fine. Looking around, it seems this could come from an error in lower-level socket calls, that I am not directly interfacing. (e.g. Here or here)

Any ideas on why am I getting this error?

And more importantly:

How I might fix it?

For reference, we are using subprocess32, which supposedly offers improved stability with multiple subprocess calls. Also, if relevant, the entire scheme is wrapped up into a larger MPI-based HPC parallel call, such that multiple compute nodes are attempting to do the same thing at the same time. Fearing there might be some conflict or filesystem challenge with multiple attempts to execute the same file, we are already copying the binary to each of these nodes before execution.

Also, I see the same problem using shell=False as in:

subprocess.Popen(gld_open_list, shell=False, stderr=self.gld_stderr_file,
                 stdout=self.gld_stdout_file, bufsize=-1, close_fds=ON_POSIX,
                 env={'TEMP':temp_path})
Community
  • 1
  • 1
Bryan P
  • 5,900
  • 5
  • 34
  • 49

3 Answers3

1

That's a bug on python2.6 and fixed on 2.7.

IOError raised due to a read system call returning EINTR from within these methods(read(), readline(), readlines()).

see: https://github.com/python/cpython/commit/736ca00270db72fefa5fb278982c96e5e7139d72

and

https://github.com/python/cpython/blob/2.6/Objects/fileobject.c#L1362.

Upgrade your python and then everything is fun.

moustachio
  • 2,924
  • 3
  • 36
  • 68
Mohanson
  • 101
  • 1
  • 2
1

I was having the same issue and somehow managed to get rid of it. I identified that the problem happens only when the code runs on different nodes in the cluster. If all the ranks are in the same node, then everything is fine. So, what I tried in my system:

  • Use shell=True in Popen. With this the problem didn't go away, but the occurrence was much lower on my HPC.
  • Stop using Anaconda. I was using mpi4py and Python from there.
  • Use Open MPI 4.0.1. I was using version 3 before.
  • Make a Python virtual environment and install all the required tools (mpi4py, NumPy, etc.) in there with pip.

I don't know which of these steps was the useful one, but the issue seems to be gone now.

mdgm
  • 224
  • 1
  • 4
0

It seems to happen with Windows. No problems with MacOS.

I have found a hacky but practical solution:

while True:
    try:
        proc = subprocess.Popen(cmd, 
                        stdout=subprocess.PIPE, stderr=subprocess.PIPE, bufsize=1)
        (output,err)=proc.communicate()
        break
    except:
        log_msg("Exception in spawning subprocess. Retrying ...") 
Frits
  • 7,341
  • 10
  • 42
  • 60
fieres
  • 736
  • 7
  • 10