3

I am using the openmdao 0.13 python module in a project. This module is only available as a virtualenv. When I activate the virtual env, it appears to only activate on one node. What could account for this strange behavior? Why are processors on non-primary nodes unable to load the virtualenv?

$ mpirun --version
mpirun (Open MPI) 1.7.3
$ qsub --version
Version: 5.1.1.2
$ qsub -V -I -l nodes=2:ppn=24
$ cd openmdao-0.10.3.2/
$ . bin/activate
$ mpirun -np 24 python -c "import openmdao"
# no errors
$ mpirun -np 27 python -c "import openmdao"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: No module named openmdao
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: No module named openmdao
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: No module named openmdao

It looks like all of my processors are referencing python correctly

$ mpirun -np 27 which python
/home/jquick/here_it_is/openmdao-0.10.3.2/bin/python
/home/jquick/here_it_is/openmdao-0.10.3.2/bin/python
/home/jquick/here_it_is/openmdao-0.10.3.2/bin/python
/home/jquick/here_it_is/openmdao-0.10.3.2/bin/python
/home/jquick/here_it_is/openmdao-0.10.3.2/bin/python
/home/jquick/here_it_is/openmdao-0.10.3.2/bin/python
/home/jquick/here_it_is/openmdao-0.10.3.2/bin/python
/home/jquick/here_it_is/openmdao-0.10.3.2/bin/python
/home/jquick/here_it_is/openmdao-0.10.3.2/bin/python
/home/jquick/here_it_is/openmdao-0.10.3.2/bin/python
/home/jquick/here_it_is/openmdao-0.10.3.2/bin/python
/home/jquick/here_it_is/openmdao-0.10.3.2/bin/python
/home/jquick/here_it_is/openmdao-0.10.3.2/bin/python
/home/jquick/here_it_is/openmdao-0.10.3.2/bin/python
/home/jquick/here_it_is/openmdao-0.10.3.2/bin/python
/home/jquick/here_it_is/openmdao-0.10.3.2/bin/python
/home/jquick/here_it_is/openmdao-0.10.3.2/bin/python
/home/jquick/here_it_is/openmdao-0.10.3.2/bin/python
/home/jquick/here_it_is/openmdao-0.10.3.2/bin/python
/home/jquick/here_it_is/openmdao-0.10.3.2/bin/python
/home/jquick/here_it_is/openmdao-0.10.3.2/bin/python
/home/jquick/here_it_is/openmdao-0.10.3.2/bin/python
/home/jquick/here_it_is/openmdao-0.10.3.2/bin/python
/home/jquick/here_it_is/openmdao-0.10.3.2/bin/python
/home/jquick/here_it_is/openmdao-0.10.3.2/bin/python
/home/jquick/here_it_is/openmdao-0.10.3.2/bin/python
/home/jquick/here_it_is/openmdao-0.10.3.2/bin/python

I don't understand what could possibly be causing this import error. What could bin/activate be affecting on my primary but node but not on secondary nodes?

kilojoules
  • 9,768
  • 18
  • 77
  • 149
  • Which batch system is this? Which MPI implementation? Does your python code actually use MPI? Try `mpiruin -np 30 bash -c '. ./bin/activate && python -c "..."'`. – Zulan Feb 26 '16 at 08:37
  • I tried this and it didn't work. I am running this through qsub's interactive options. I also tried `mpirun -np 30 ./bin/python -c "import openmdao"` which did not work either. – kilojoules Feb 27 '16 at 16:15
  • How did it not work, does it give the same error? In general it seems your environment gets lost. You need to figure out if so and where - and either preserve it or restore it. You should be able to restore it with a wrapper script containing `. ./bin/activate`. You can debug what the environment is in the stages e.g. by printing `sys.path` within the python processes or by running an `mpirun env | grep PATH` etc. – Zulan Feb 27 '16 at 16:27
  • yes it gives the same error. `mpirun -np 27 echo $PATH` shows the same output on all processors. Similarly, `mpirun -np 27 python -c "import sys; "print sys.path"` shows the same series of paths printed out 27 times. I am curious about the wrapper script - could you expand on that please? – kilojoules Feb 27 '16 at 16:40
  • `mpirun -np 27 echo $PATH` does not work as you might think. This just replaces the `$PATH` once by the shell and then executes `echo /bin:/...` on all nodes. – Zulan Feb 27 '16 at 16:43
  • If you environment is intact, maybe the home file system is not shared / not available on the second node? – Zulan Feb 27 '16 at 16:44
  • Thank you for all of these great ideas! The users guide to this system recommends running qsub jobs in a /scratch folder. I built the openmdao environment there but I am still getting the same ImportError failure. – kilojoules Feb 27 '16 at 17:17
  • A last resort would be to use something like `mpirun -np 27 strace -o debug.log -ff -e trace=file python...`. This records all file operations into a log so you can debug a) Where does python actually look for the module b) what goes wrong if it tries to open the right file. Preferably you do this with a wrapper script that writes the adds the `hostname` to the log file name. – Zulan Feb 28 '16 at 09:23
  • @Zulan The unsuccessful run log files have two of these lines before close to the end of the log `open("", O_RDONLY) = -1 ENOENT (No such file or directory)`, which is a line the successful log files don't have. Could this be the problem? – kilojoules Feb 29 '16 at 17:00
  • Yes, that looks suspicious indeed. However, usually I would expect to find a line in the *successful* logs that is *not present* in the unsuccessful logs. – Zulan Feb 29 '16 at 17:50
  • @Zulan The successful logs are 4.7 MB each while the unsuccessful logs are 140 KB each. It's hard to know what to look for. – kilojoules Feb 29 '16 at 17:53
  • Have you looked at something along the lines of `open(*openmdao*)`? – Zulan Feb 29 '16 at 18:02
  • @Zulan I looked through all the files like this: `grep "open(*o" *` and didn't find anything – kilojoules Mar 01 '16 at 16:56
  • It is getting a little misty, but here are some more ideas: 1) Run the `python` import with `python -v ...`. 2) `diff` a pair of successful vs unsuccessful logs and see where they diverge. You may have to filter, look especially for `open`. – Zulan Mar 03 '16 at 17:00

1 Answers1

0

Activating the virtualenv before starting the qsub job fixed this for me

kilojoules
  • 9,768
  • 18
  • 77
  • 149