3

I have a project structured as follows;

- topmodule/
   - childmodule1/
      -  my_func1.py
   - childmodule2/
      -  my_func2.py
   - common.py
   - __init__.py

From my Jupyter notebook on an edge-node of a Dask cluster, I am doing the following

from topmodule.childmodule1.my_func1 import MyFuncClass1
from topmodule.childmodule2.my_func2 import MyFuncClass2

Then I am creating a distributed client & sending work as follows;

client = Client(YarnCluster())
client.submit(MyFuncClass1.execute)

This errors out, because the workers do not have the files of topmodule.

"/mnt1/yarn/usercache/hadoop/appcache/application_1572459480364_0007/container_1572459480364_0007_01_000003/environment/lib/python3.7/site-packages/distributed/protocol/pickle.py", line 59, in loads return pickle.loads(x) ModuleNotFoundError: No module named 'topmodule'

So what I tried to do is - I tried uploading every single file under "topmodule". The files directly under the "topmodule" seems to get uploaded, but the nested ones do not. Below is what I am talking about;

Code:

from pathlib import Path

for filename in Path('topmodule').rglob('*.py'):
    print(filename)
    client.upload_file(filename)

Console output:

topmodule/common.py # processes fine 
topmodule/__init__.py # processes fine 
topmodule/childmodule1/my_func1.py # throws error

Traceback:


---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-13-dbf487d43120> in <module>
      3 for filename in Path('nodes').rglob('*.py'):
      4     print(filename)
----> 5     client.upload_file(filename)

~/miniconda/lib/python3.7/site-packages/distributed/client.py in upload_file(self, filename, **kwargs)
   2929         )
   2930         if isinstance(result, Exception):
-> 2931             raise result
   2932         else:
   2933             return result

ModuleNotFoundError: No module named 'topmodule'

My question is - how can I upload an entire module and its files to workers? Our module is big so I want to avoid restructuring it just for this issue, unless the way we're structuring the module is fundamentally flawed.

Or - is there a better way to have all dask workers understand the modules perhaps from a git repository?

Jenna Kwon
  • 1,212
  • 1
  • 12
  • 22
  • We had similar problems - this is probably since the `yarn client` is not able load /access the nested module [see this github issue](https://github.com/dask/dask-yarn/issues/86). – skibee Oct 31 '19 at 09:26
  • @JosephBerry I see... I think pip install from a specific git repo is a god idea. Our repo is hosted on Amazon AWS - I will try and see if that works.. – Jenna Kwon Oct 31 '19 at 12:19
  • @JennaKwon how did you solve this? – ps0604 Jul 04 '21 at 15:31

1 Answers1

0

When you call upload_file on every file individually you lose the directory structure of your module.

If you want to upload a more comprehensive module you can package up your module into a zip or egg file and upload that.

https://docs.dask.org/en/latest/futures.html#distributed.Client.upload_file

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • 1
    I did try the zip approach. The import statement didn't work though. I logged on to a worker node and validated that the zipped directory was under a dask-worker-space location. Is there somewhere else the zipped directory should go? I am using a YarnCluster (AmazonEMR). – Jenna Kwon Nov 04 '19 at 06:07