I have a project structured as follows;
- topmodule/
- childmodule1/
- my_func1.py
- childmodule2/
- my_func2.py
- common.py
- __init__.py
From my Jupyter notebook on an edge-node of a Dask cluster, I am doing the following
from topmodule.childmodule1.my_func1 import MyFuncClass1
from topmodule.childmodule2.my_func2 import MyFuncClass2
Then I am creating a distributed client & sending work as follows;
client = Client(YarnCluster())
client.submit(MyFuncClass1.execute)
This errors out, because the workers do not have the files of topmodule.
"/mnt1/yarn/usercache/hadoop/appcache/application_1572459480364_0007/container_1572459480364_0007_01_000003/environment/lib/python3.7/site-packages/distributed/protocol/pickle.py", line 59, in loads return pickle.loads(x) ModuleNotFoundError: No module named 'topmodule'
So what I tried to do is - I tried uploading every single file under "topmodule". The files directly under the "topmodule" seems to get uploaded, but the nested ones do not. Below is what I am talking about;
Code:
from pathlib import Path
for filename in Path('topmodule').rglob('*.py'):
print(filename)
client.upload_file(filename)
Console output:
topmodule/common.py # processes fine
topmodule/__init__.py # processes fine
topmodule/childmodule1/my_func1.py # throws error
Traceback:
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-13-dbf487d43120> in <module>
3 for filename in Path('nodes').rglob('*.py'):
4 print(filename)
----> 5 client.upload_file(filename)
~/miniconda/lib/python3.7/site-packages/distributed/client.py in upload_file(self, filename, **kwargs)
2929 )
2930 if isinstance(result, Exception):
-> 2931 raise result
2932 else:
2933 return result
ModuleNotFoundError: No module named 'topmodule'
My question is - how can I upload an entire module and its files to workers? Our module is big so I want to avoid restructuring it just for this issue, unless the way we're structuring the module is fundamentally flawed.
Or - is there a better way to have all dask workers understand the modules perhaps from a git repository?